Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets
| dc.contributor.author | Albayrak, Muammer | |
| dc.contributor.author | Turhan, Kemal | |
| dc.date.accessioned | 2025-11-03T17:01:05Z | |
| dc.date.available | 2025-11-03T17:01:05Z | |
| dc.date.issued | 2025-09-01 | |
| dc.description.abstract | Feature selection is becoming more and more important for machine learning and data mining. Especially for high dimensional datasets, it is necessary to filter out irrelevant and unnecessary features to overcome the problems of overfitting and multidimensionality. We hypothesized that an effective feature selection can be made with a model-based approach using the Structural Equation Modeling (SEM) method. The dataset consists of 2969 samples and 117 features. First, a measurement model created was tested with confirmatory factor analysis (CFA) and the number of features was reduced to 58 by removing the statistically insignificant features. In SEM analysis, sub-feature sets consisting of 55, 52, 41 and 35 features were obtained by removing the variables whose relationship was below the threshold values determined for the standardized regression coefficient (SRC). The obtained sub-feature sets were tested with a multilayer perceptron (MLP) and their effect on performance was examined. Results were compared with random forest feature importance as baseline method. SEM and random forest have generally performed very closely. While sub-feature sets created with the random forest in two-class classification produced better results, the sub-feature sets created with the suggested SEM-based method in three and five-class classification provided better performance. These results showed that effective feature selection can be made with the proposed model-based approach using SEM. With this approach, it is possible to obtain sub-feature sets that form a model which statistically significant and consistent with field knowledge by including expert knowledge in the modeling process. | en_US |
| dc.description.sponsorship | The data for this study was obtained from the dbGaP (Database of Genotypes and Phenotypes) database, which is maintained by the National Center for Biotechnology Information (NCBI). This database contains a wealth of genotype and phenotype data sets from various research projects. The first dataset utilized in this analysis is titled \"CIDR, NCI, NIDA Sequencing of Targeted Genomic Regions Associated with Smoking\". This dataset includes samples from two major projects: the Collaborative Genetic Study of Nicotine Addiction (COGEND) and the Genetic Study of Nicotine Addiction in African Americans (AAND). In total, the dataset contains 2969 samples with 117 variables (dbGaP Study Accession: phs000813.v1.p1). The second dataset is titled \" Study of Addiction: Genetics and Environment (SAGE)\". This study is part of the Gene Environment Association Studies initiative (GENEVA) funded by the National Human Genome Research Institute. In total, the dataset contains 3847 samples with 124 variables (dbGaP Study Accession: phs000092.v1.p1). The data from these two datasets includes a range of measures related to nicotine, alcohol, marijuana, cocaine, and opiate usage, such as scales for nicotine dependence (FTND, WISDM, NDSS) and substance abuse symptoms. Additionally, physical characteristics (age, height, weight) and socioeconomic status variables (income, education level) are provided. | |
| dc.description.sponsorship | GENEVA; National Human Genome Research Institute, NCHGR | |
| dc.identifier.doi | 10.35378/gujs.1507978 | |
| dc.identifier.issn | 2147-1762 | |
| dc.identifier.scopus | 2-s2.0-105018453663 | |
| dc.identifier.uri | https://doi.org/10.35378/gujs.1507978 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14365/6535 | |
| dc.identifier.uri | https://search.trdizin.gov.tr/en/yayin/detay/1351593/model-based-feature-selection-using-structural-equation-modeling-for-enhanced-classification-performance-in-high-dimensional-datasets | |
| dc.identifier.uri | https://search.trdizin.gov.tr/en/yayin/detay/1351593 | |
| dc.language.iso | en | en_US |
| dc.publisher | Gazi University | en_US |
| dc.relation.ispartof | Gazi University Journal of Science | en_US |
| dc.rights | info:eu-repo/semantics/openAccess | en_US |
| dc.subject | Artificial Neural Networks | en_US |
| dc.subject | Feature Importance | en_US |
| dc.subject | Feature Selection | en_US |
| dc.subject | Structural Equation Modeling | en_US |
| dc.subject | Feature Selection Structural Equation Modeling | |
| dc.subject | Bilgisayar Bilimleri, Yapay Zeka | |
| dc.subject | Feature Importance Artificial Neural Networks | |
| dc.title | Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets | en_US |
| dc.type | Article | en_US |
| dspace.entity.type | Publication | |
| gdc.author.id | 0000-0001-7871-3025 | |
| gdc.author.id | 0000-0002-5946-6310 | |
| gdc.author.institutional | Turhan, Kutsal | |
| gdc.author.scopusid | 14628008200 | |
| gdc.author.scopusid | 56023880500 | |
| gdc.author.wosid | turhan, kemal/B-8494-2011 | |
| gdc.author.wosid | Albayrak, Muammer/R-3323-2019 | |
| gdc.bip.impulseclass | C5 | |
| gdc.bip.influenceclass | C5 | |
| gdc.bip.popularityclass | C5 | |
| gdc.coar.access | open access | |
| gdc.coar.type | text::journal::journal article | |
| gdc.collaboration.industrial | false | |
| gdc.description.department | İEÜ, Tıp Fakültesi, Temel Tıp Bilimleri Bölümü | en_US |
| gdc.description.departmenttemp | Karadeniz Teknik Üniversitesi,İzmir Ekonomi Üniversitesi | en_US |
| gdc.description.endpage | 1260 | en_US |
| gdc.description.issue | 3 | en_US |
| gdc.description.publicationcategory | Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı | en_US |
| gdc.description.scopusquality | Q3 | |
| gdc.description.startpage | 1247 | en_US |
| gdc.description.volume | 38 | en_US |
| gdc.description.woscitationindex | Emerging Sources Citation Index | |
| gdc.description.wosquality | Q3 | |
| gdc.identifier.openalex | W4412758337 | |
| gdc.identifier.trdizinid | 1351593 | |
| gdc.identifier.wos | WOS:001576896900006 | |
| gdc.index.type | WoS | |
| gdc.index.type | Scopus | |
| gdc.index.type | TR-Dizin | |
| gdc.oaire.accesstype | GOLD | |
| gdc.oaire.diamondjournal | false | |
| gdc.oaire.impulse | 0.0 | |
| gdc.oaire.influence | 2.4895952E-9 | |
| gdc.oaire.isgreen | false | |
| gdc.oaire.popularity | 2.7494755E-9 | |
| gdc.oaire.publicfunded | false | |
| gdc.openalex.collaboration | National | |
| gdc.openalex.fwci | 0.0 | |
| gdc.openalex.normalizedpercentile | 0.22 | |
| gdc.openalex.toppercent | TOP 10% | |
| gdc.opencitations.count | 0 | |
| gdc.plumx.scopuscites | 0 | |
| gdc.scopus.citedcount | 0 | |
| gdc.virtual.author | Turhan, Kutsal | |
| gdc.virtual.author | Turhan, Kemal | |
| gdc.wos.citedcount | 0 | |
| relation.isAuthorOfPublication | 8d56352f-325d-4037-8180-9eafadecc821 | |
| relation.isAuthorOfPublication | 0af17f77-3168-4aaf-a45e-2cbf889758a7 | |
| relation.isAuthorOfPublication.latestForDiscovery | 8d56352f-325d-4037-8180-9eafadecc821 | |
| relation.isOrgUnitOfPublication | fb5b4042-739a-4880-8e29-a9adb71d6492 | |
| relation.isOrgUnitOfPublication | fbc53f3e-d1d3-4168-afd8-e42cd20bddd9 | |
| relation.isOrgUnitOfPublication | e9e77e3e-bc94-40a7-9b24-b807b2cd0319 | |
| relation.isOrgUnitOfPublication | 4cbb0a74-ee1a-438b-b714-b8ef253df94b | |
| relation.isOrgUnitOfPublication.latestForDiscovery | fb5b4042-739a-4880-8e29-a9adb71d6492 |
