A Proposal for Corpus Normalization

Loading...
Publication Logo

Date

2013

Journal Title

Journal ISSN

Volume Title

Publisher

IEEE

Open Access Color

Green Open Access

No

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

No
Impulse
Average
Influence
Average
Popularity
Average

Research Projects

Journal Issue

Abstract

In order to compare work done under natural language processing, the corpora involved in different studies should be standardized/normalized. Entropy, used as language model performance metric, totally depends on signal information. Whereas, when language is considered semantic information should also be considered. Here we propose a metric that exploits Zipf's and Heaps' power laws to respresent semantic information in terms of signal information and estimates the amount of information anticipated from a corpus of given length in words. The proposed metric is tested on 20 different lengths of sub-corpora drawn from major corpus in Turkish (METU). While the entropy changed depending on the length of the corpus, the value of our proposed metric stayed almost constant which supports our claim about normalizing the corpus.

Description

21st Signal Processing and Communications Applications Conference (SIU) -- APR 24-26, 2013 -- CYPRUS

Keywords

language model performance, corpus comparison, cross entropy, corpus comparison, language model performance, cross entropy

Fields of Science

Citation

WoS Q

N/A

Scopus Q

N/A
OpenCitations Logo
OpenCitations Citation Count
N/A

Source

2013 21St Sıgnal Processıng And Communıcatıons Applıcatıons Conference (Sıu)

Volume

Issue

Start Page

1

End Page

4
PlumX Metrics
Citations

Scopus : 0

Captures

Mendeley Readers : 4

Google Scholar Logo
Google Scholar™
OpenAlex Logo
OpenAlex FWCI
0.0

Sustainable Development Goals

SDG data is not available