A Proposal for Corpus Normalization
Loading...
Files
Date
2013
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Open Access Color
Green Open Access
No
OpenAIRE Downloads
OpenAIRE Views
Publicly Funded
No
Abstract
In order to compare work done under natural language processing, the corpora involved in different studies should be standardized/normalized. Entropy, used as language model performance metric, totally depends on signal information. Whereas, when language is considered semantic information should also be considered. Here we propose a metric that exploits Zipf's and Heaps' power laws to respresent semantic information in terms of signal information and estimates the amount of information anticipated from a corpus of given length in words. The proposed metric is tested on 20 different lengths of sub-corpora drawn from major corpus in Turkish (METU). While the entropy changed depending on the length of the corpus, the value of our proposed metric stayed almost constant which supports our claim about normalizing the corpus.
Description
21st Signal Processing and Communications Applications Conference (SIU) -- APR 24-26, 2013 -- CYPRUS
ORCID
Keywords
language model performance, corpus comparison, cross entropy, corpus comparison, language model performance, cross entropy
Fields of Science
Citation
WoS Q
N/A
Scopus Q
N/A

OpenCitations Citation Count
N/A
Source
2013 21St Sıgnal Processıng And Communıcatıons Applıcatıons Conference (Sıu)
Volume
Issue
Start Page
1
End Page
4
PlumX Metrics
Citations
Scopus : 0
Captures
Mendeley Readers : 4
Google Scholar™


