Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.14365/3369
Title: Binary Text Representation for Feature Selection
Authors: Lang N.
Zincir I.
Zincir-Heywood N.
Keywords: Binary representation
Feature selection
Text classification
Classification (of information)
Information theory
NP-hard
Redundancy
Text processing
Binary representations
Empirical experiments
Feature selection methods
Information-theoretic approach
Maximum relevance minimum redundancies
Near-optimal solutions
Selection techniques
Theoretical approximations
Feature extraction
Publisher: Springer Science and Business Media Deutschland GmbH
Abstract: In many real-world applications, a high number of words could result in noisy and redundant information, which could degrade the general performance of text classification tasks. Feature selection techniques with the purpose of eliminating uninformative words have been actively studied. In several information-theoretic approaches, such features are conventionally obtained by maximizing relevance to the class while the redundancy among the features used is minimized. This is an NP-hard problem and still remains to be a challenge. In this work, we propose an alternative feature selection strategy on binary representation data, with the purpose of providing a theoretical lower bound for finding a near optimal solution based on the Maximum Relevance-Minimum Redundancy criterion. In doing so, the proposed strategy can achieve a theoretical approximation ratio of 12 by a naive greedy search. The proposed strategy is validated by empirical experiments on five publicly available datasets, namely, Cora, Citeseer, WebKB, SMS Spam and Spambase. Their effectiveness is shown for binary text classification tasks when compared with well-known filter feature selection methods and mutual information-based methods. © 2021, Springer Nature Switzerland AG.
Description: Future Technologies Conference, FTC 2020 -- 5 November 2020 through 6 November 2020 -- 251149
URI: https://doi.org/10.1007/978-3-030-63128-4_52
https://hdl.handle.net/20.500.14365/3369
ISBN: 9.78303E+12
ISSN: 2194-5357
Appears in Collections:Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

Files in This Item:
File SizeFormat 
3369.pdf
  Restricted Access
606.24 kBAdobe PDFView/Open    Request a copy
Show full item record



CORE Recommender

SCOPUSTM   
Citations

1
checked on Nov 27, 2024

Page view(s)

48
checked on Nov 25, 2024

Download(s)

6
checked on Nov 25, 2024

Google ScholarTM

Check




Altmetric


Items in GCRIS Repository are protected by copyright, with all rights reserved, unless otherwise indicated.