Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

Albayrak, Muammer; Turhan, Kemal

doi:10.35378/gujs.1507978

Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

dc.contributor.author	Albayrak, Muammer
dc.contributor.author	Turhan, Kemal
dc.date.accessioned	2025-11-03T17:01:05Z
dc.date.available	2025-11-03T17:01:05Z
dc.date.issued	2025-09-01
dc.description.abstract	Feature selection is becoming more and more important for machine learning and data mining. Especially for high dimensional datasets, it is necessary to filter out irrelevant and unnecessary features to overcome the problems of overfitting and multidimensionality. We hypothesized that an effective feature selection can be made with a model-based approach using the Structural Equation Modeling (SEM) method. The dataset consists of 2969 samples and 117 features. First, a measurement model created was tested with confirmatory factor analysis (CFA) and the number of features was reduced to 58 by removing the statistically insignificant features. In SEM analysis, sub-feature sets consisting of 55, 52, 41 and 35 features were obtained by removing the variables whose relationship was below the threshold values determined for the standardized regression coefficient (SRC). The obtained sub-feature sets were tested with a multilayer perceptron (MLP) and their effect on performance was examined. Results were compared with random forest feature importance as baseline method. SEM and random forest have generally performed very closely. While sub-feature sets created with the random forest in two-class classification produced better results, the sub-feature sets created with the suggested SEM-based method in three and five-class classification provided better performance. These results showed that effective feature selection can be made with the proposed model-based approach using SEM. With this approach, it is possible to obtain sub-feature sets that form a model which statistically significant and consistent with field knowledge by including expert knowledge in the modeling process.	en_US
dc.description.sponsorship	The data for this study was obtained from the dbGaP (Database of Genotypes and Phenotypes) database, which is maintained by the National Center for Biotechnology Information (NCBI). This database contains a wealth of genotype and phenotype data sets from various research projects. The first dataset utilized in this analysis is titled \"CIDR, NCI, NIDA Sequencing of Targeted Genomic Regions Associated with Smoking\". This dataset includes samples from two major projects: the Collaborative Genetic Study of Nicotine Addiction (COGEND) and the Genetic Study of Nicotine Addiction in African Americans (AAND). In total, the dataset contains 2969 samples with 117 variables (dbGaP Study Accession: phs000813.v1.p1). The second dataset is titled \" Study of Addiction: Genetics and Environment (SAGE)\". This study is part of the Gene Environment Association Studies initiative (GENEVA) funded by the National Human Genome Research Institute. In total, the dataset contains 3847 samples with 124 variables (dbGaP Study Accession: phs000092.v1.p1). The data from these two datasets includes a range of measures related to nicotine, alcohol, marijuana, cocaine, and opiate usage, such as scales for nicotine dependence (FTND, WISDM, NDSS) and substance abuse symptoms. Additionally, physical characteristics (age, height, weight) and socioeconomic status variables (income, education level) are provided.
dc.description.sponsorship	GENEVA; National Human Genome Research Institute, NCHGR
dc.identifier.doi	10.35378/gujs.1507978
dc.identifier.issn	2147-1762
dc.identifier.scopus	2-s2.0-105018453663
dc.identifier.uri	https://doi.org/10.35378/gujs.1507978
dc.identifier.uri	https://hdl.handle.net/20.500.14365/6535
dc.identifier.uri	https://search.trdizin.gov.tr/en/yayin/detay/1351593/model-based-feature-selection-using-structural-equation-modeling-for-enhanced-classification-performance-in-high-dimensional-datasets
dc.identifier.uri	https://search.trdizin.gov.tr/en/yayin/detay/1351593
dc.language.iso	en	en_US
dc.publisher	Gazi University	en_US
dc.relation.ispartof	Gazi University Journal of Science	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Artificial Neural Networks	en_US
dc.subject	Feature Importance	en_US
dc.subject	Feature Selection	en_US
dc.subject	Structural Equation Modeling	en_US
dc.subject	Feature Selection Structural Equation Modeling
dc.subject	Bilgisayar Bilimleri, Yapay Zeka
dc.subject	Feature Importance Artificial Neural Networks
dc.title	Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets	en_US
dc.type	Article	en_US
dspace.entity.type	Publication
gdc.author.id	0000-0001-7871-3025
gdc.author.id	0000-0002-5946-6310
gdc.author.institutional	Turhan, Kutsal
gdc.author.scopusid	14628008200
gdc.author.scopusid	56023880500
gdc.author.wosid	turhan, kemal/B-8494-2011
gdc.author.wosid	Albayrak, Muammer/R-3323-2019
gdc.bip.impulseclass	C5
gdc.bip.influenceclass	C5
gdc.bip.popularityclass	C5
gdc.coar.access	open access
gdc.coar.type	text::journal::journal article
gdc.collaboration.industrial	false
gdc.description.department	İEÜ, Tıp Fakültesi, Temel Tıp Bilimleri Bölümü	en_US
gdc.description.departmenttemp	Karadeniz Teknik Üniversitesi,İzmir Ekonomi Üniversitesi	en_US
gdc.description.endpage	1260	en_US
gdc.description.issue	3	en_US
gdc.description.publicationcategory	Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı	en_US
gdc.description.scopusquality	Q3
gdc.description.startpage	1247	en_US
gdc.description.volume	38	en_US
gdc.description.woscitationindex	Emerging Sources Citation Index
gdc.description.wosquality	Q3
gdc.identifier.openalex	W4412758337
gdc.identifier.trdizinid	1351593
gdc.identifier.wos	WOS:001576896900006
gdc.index.type	WoS
gdc.index.type	Scopus
gdc.index.type	TR-Dizin
gdc.oaire.accesstype	GOLD
gdc.oaire.diamondjournal	false
gdc.oaire.impulse	0.0
gdc.oaire.influence	2.4895952E-9
gdc.oaire.isgreen	false
gdc.oaire.popularity	2.7494755E-9
gdc.oaire.publicfunded	false
gdc.openalex.collaboration	National
gdc.openalex.fwci	0.0
gdc.openalex.normalizedpercentile	0.22
gdc.openalex.toppercent	TOP 10%
gdc.opencitations.count	0
gdc.plumx.scopuscites	0
gdc.scopus.citedcount	0
gdc.wos.citedcount	0
relation.isAuthorOfPublication.latestForDiscovery	8d56352f-325d-4037-8180-9eafadecc821
relation.isOrgUnitOfPublication.latestForDiscovery	fb5b4042-739a-4880-8e29-a9adb71d6492

Collections

TR Dizin İndeksli Yayınlar Koleksiyonu / TR Dizin Indexed Publications Collection
Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection

Model-Based Feature Selection Using Structural Equation Modeling for Enhanced Classification Performance in High-Dimensional Datasets

Files

Collections