A Proposal for Corpus Normalization

dc.contributor.author Karaoglan, Bahar
dc.contributor.author Kisla, Tarik
dc.contributor.author Dincer, Bekir Taner
dc.contributor.author Metin, Senem Kumova
dc.date.accessioned 2023-06-16T14:50:29Z
dc.date.available 2023-06-16T14:50:29Z
dc.date.issued 2013
dc.description 21st Signal Processing and Communications Applications Conference (SIU) -- APR 24-26, 2013 -- CYPRUS en_US
dc.description.abstract In order to compare work done under natural language processing, the corpora involved in different studies should be standardized/normalized. Entropy, used as language model performance metric, totally depends on signal information. Whereas, when language is considered semantic information should also be considered. Here we propose a metric that exploits Zipf's and Heaps' power laws to respresent semantic information in terms of signal information and estimates the amount of information anticipated from a corpus of given length in words. The proposed metric is tested on 20 different lengths of sub-corpora drawn from major corpus in Turkish (METU). While the entropy changed depending on the length of the corpus, the value of our proposed metric stayed almost constant which supports our claim about normalizing the corpus. en_US
dc.identifier.doi 10.1109/SIU.2013.6531217
dc.identifier.isbn 978-1-4673-5563-6
dc.identifier.isbn 978-1-4673-5562-9
dc.identifier.issn 2165-0608
dc.identifier.scopus 2-s2.0-84880873119
dc.identifier.uri https://hdl.handle.net/20.500.14365/2815
dc.language.iso tr en_US
dc.publisher IEEE en_US
dc.relation.ispartof 2013 21St Sıgnal Processıng And Communıcatıons Applıcatıons Conference (Sıu) en_US
dc.rights info:eu-repo/semantics/closedAccess en_US
dc.subject language model performance en_US
dc.subject corpus comparison en_US
dc.subject cross entropy en_US
dc.title A Proposal for Corpus Normalization en_US
dc.type Conference Object en_US
dspace.entity.type Publication
gdc.author.id Dinçer, Bekir Taner/0000-0002-0660-7239
gdc.author.wosid Dinçer, Bekir Taner/AAU-7709-2020
gdc.bip.impulseclass C5
gdc.bip.influenceclass C5
gdc.bip.popularityclass C5
gdc.coar.access metadata only access
gdc.coar.type text::conference output
gdc.collaboration.industrial false
gdc.description.department İEÜ, Mühendislik Fakültesi, Yazılım Mühendisliği Bölümü en_US
gdc.description.departmenttemp [Karaoglan, Bahar] Ege Univ, Uluslararasi Bilgisayar Enstitusu, Izmir, Turkey; [Dincer, Bekir Taner] Mugla Univ, Enformat Bolumu, Mugla, Turkey; [Kisla, Tarik] Ege Univ, Bilgisayar & Ogretim Teknolojileri Egitimi Bolum, Izmir, Turkey; [Metin, Senem Kumova] Izmir Ekonomi Univ, Yazilim Muhendisligi Bolumu, Izmir, Turkey en_US
gdc.description.endpage 4
gdc.description.publicationcategory Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı en_US
gdc.description.scopusquality N/A
gdc.description.startpage 1
gdc.description.wosquality N/A
gdc.identifier.openalex W2022745987
gdc.identifier.wos WOS:000325005300058
gdc.index.type WoS
gdc.index.type Scopus
gdc.oaire.diamondjournal false
gdc.oaire.impulse 0.0
gdc.oaire.influence 2.5349236E-9
gdc.oaire.isgreen false
gdc.oaire.keywords corpus comparison
gdc.oaire.keywords language model performance
gdc.oaire.keywords cross entropy
gdc.oaire.popularity 6.0129673E-10
gdc.oaire.publicfunded false
gdc.openalex.collaboration National
gdc.openalex.fwci 0.0
gdc.openalex.normalizedpercentile 0.08
gdc.opencitations.count 0
gdc.plumx.mendeley 4
gdc.plumx.scopuscites 0
gdc.scopus.citedcount 0
gdc.virtual.author Kumova Metin, Senem
gdc.wos.citedcount 0
relation.isAuthorOfPublication 81d6fcea-c590-42aa-8443-7459c9eab7fa
relation.isAuthorOfPublication.latestForDiscovery 81d6fcea-c590-42aa-8443-7459c9eab7fa
relation.isOrgUnitOfPublication 805c60d5-b806-4645-8214-dd40524c388f
relation.isOrgUnitOfPublication 26a7372c-1a5e-42d9-90b6-a3f7d14cad44
relation.isOrgUnitOfPublication e9e77e3e-bc94-40a7-9b24-b807b2cd0319
relation.isOrgUnitOfPublication.latestForDiscovery 805c60d5-b806-4645-8214-dd40524c388f

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
2815.pdf
Size:
323.29 KB
Format:
Adobe Portable Document Format