Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

dc.contributor.author Albayrak, Muammer
dc.contributor.author Turhan, Kemal
dc.date.accessioned 2025-11-03T17:01:05Z
dc.date.available 2025-11-03T17:01:05Z
dc.date.issued 2025-09-01
dc.description.abstract Feature selection is becoming more and more important for machine learning and data mining. Especially for high dimensional datasets, it is necessary to filter out irrelevant and unnecessary features to overcome the problems of overfitting and multidimensionality. We hypothesized that an effective feature selection can be made with a model-based approach using the Structural Equation Modeling (SEM) method. The dataset consists of 2969 samples and 117 features. First, a measurement model created was tested with confirmatory factor analysis (CFA) and the number of features was reduced to 58 by removing the statistically insignificant features. In SEM analysis, sub-feature sets consisting of 55, 52, 41 and 35 features were obtained by removing the variables whose relationship was below the threshold values determined for the standardized regression coefficient (SRC). The obtained sub-feature sets were tested with a multilayer perceptron (MLP) and their effect on performance was examined. Results were compared with random forest feature importance as baseline method. SEM and random forest have generally performed very closely. While sub-feature sets created with the random forest in two-class classification produced better results, the sub-feature sets created with the suggested SEM-based method in three and five-class classification provided better performance. These results showed that effective feature selection can be made with the proposed model-based approach using SEM. With this approach, it is possible to obtain sub-feature sets that form a model which statistically significant and consistent with field knowledge by including expert knowledge in the modeling process. en_US
dc.description.sponsorship The data for this study was obtained from the dbGaP (Database of Genotypes and Phenotypes) database, which is maintained by the National Center for Biotechnology Information (NCBI). This database contains a wealth of genotype and phenotype data sets from various research projects. The first dataset utilized in this analysis is titled \"CIDR, NCI, NIDA Sequencing of Targeted Genomic Regions Associated with Smoking\". This dataset includes samples from two major projects: the Collaborative Genetic Study of Nicotine Addiction (COGEND) and the Genetic Study of Nicotine Addiction in African Americans (AAND). In total, the dataset contains 2969 samples with 117 variables (dbGaP Study Accession: phs000813.v1.p1). The second dataset is titled \" Study of Addiction: Genetics and Environment (SAGE)\". This study is part of the Gene Environment Association Studies initiative (GENEVA) funded by the National Human Genome Research Institute. In total, the dataset contains 3847 samples with 124 variables (dbGaP Study Accession: phs000092.v1.p1). The data from these two datasets includes a range of measures related to nicotine, alcohol, marijuana, cocaine, and opiate usage, such as scales for nicotine dependence (FTND, WISDM, NDSS) and substance abuse symptoms. Additionally, physical characteristics (age, height, weight) and socioeconomic status variables (income, education level) are provided.
dc.description.sponsorship GENEVA; National Human Genome Research Institute, NCHGR
dc.identifier.doi 10.35378/gujs.1507978
dc.identifier.issn 2147-1762
dc.identifier.scopus 2-s2.0-105018453663
dc.identifier.uri https://doi.org/10.35378/gujs.1507978
dc.identifier.uri https://hdl.handle.net/20.500.14365/6535
dc.identifier.uri https://search.trdizin.gov.tr/en/yayin/detay/1351593/model-based-feature-selection-using-structural-equation-modeling-for-enhanced-classification-performance-in-high-dimensional-datasets
dc.identifier.uri https://search.trdizin.gov.tr/en/yayin/detay/1351593
dc.language.iso en en_US
dc.publisher Gazi University en_US
dc.relation.ispartof Gazi University Journal of Science en_US
dc.rights info:eu-repo/semantics/openAccess en_US
dc.subject Artificial Neural Networks en_US
dc.subject Feature Importance en_US
dc.subject Feature Selection en_US
dc.subject Structural Equation Modeling en_US
dc.subject Feature Selection Structural Equation Modeling
dc.subject Bilgisayar Bilimleri, Yapay Zeka
dc.subject Feature Importance Artificial Neural Networks
dc.title Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets en_US
dc.type Article en_US
dspace.entity.type Publication
gdc.author.id 0000-0001-7871-3025
gdc.author.id 0000-0002-5946-6310
gdc.author.institutional Turhan, Kutsal
gdc.author.scopusid 14628008200
gdc.author.scopusid 56023880500
gdc.author.wosid turhan, kemal/B-8494-2011
gdc.author.wosid Albayrak, Muammer/R-3323-2019
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access open access
gdc.coar.type text::journal::journal article
gdc.collaboration.industrial false
gdc.description.department İEÜ, Tıp Fakültesi, Temel Tıp Bilimleri Bölümü en_US
gdc.description.departmenttemp Karadeniz Teknik Üniversitesi,İzmir Ekonomi Üniversitesi en_US
gdc.description.endpage 1260 en_US
gdc.description.issue 3 en_US
gdc.description.publicationcategory Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality Q3
gdc.description.startpage 1247 en_US
gdc.description.volume 38 en_US
gdc.description.woscitationindex Emerging Sources Citation Index
gdc.description.wosquality Q3
gdc.identifier.openalex W4412758337
gdc.identifier.trdizinid 1351593
gdc.identifier.wos WOS:001576896900006
gdc.index.type WoS
gdc.index.type Scopus
gdc.index.type TR-Dizin
gdc.oaire.accesstype GOLD
gdc.oaire.diamondjournal false
gdc.oaire.impulse 0.0
gdc.oaire.influence 2.4895952E-9
gdc.oaire.isgreen false
gdc.oaire.popularity 2.7494755E-9
gdc.oaire.publicfunded false
gdc.openalex.collaboration National
gdc.openalex.fwci 0.0
gdc.openalex.normalizedpercentile 0.22
gdc.openalex.toppercent TOP 10%
gdc.opencitations.count 0
gdc.plumx.scopuscites 0
gdc.scopus.citedcount 0
gdc.virtual.author Turhan, Kutsal
gdc.virtual.author Turhan, Kemal
gdc.wos.citedcount 0
relation.isAuthorOfPublication 8d56352f-325d-4037-8180-9eafadecc821
relation.isAuthorOfPublication 0af17f77-3168-4aaf-a45e-2cbf889758a7
relation.isAuthorOfPublication.latestForDiscovery 8d56352f-325d-4037-8180-9eafadecc821
relation.isOrgUnitOfPublication fb5b4042-739a-4880-8e29-a9adb71d6492
relation.isOrgUnitOfPublication fbc53f3e-d1d3-4168-afd8-e42cd20bddd9
relation.isOrgUnitOfPublication e9e77e3e-bc94-40a7-9b24-b807b2cd0319
relation.isOrgUnitOfPublication 4cbb0a74-ee1a-438b-b714-b8ef253df94b
relation.isOrgUnitOfPublication.latestForDiscovery fb5b4042-739a-4880-8e29-a9adb71d6492

Files