Identification of Multiword Expressions in Turkish Based on Web Data

Uymaz, Hande Aka

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.14365/34

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Metin, Senem Kumova	-
dc.contributor.author	Uymaz, Hande Aka	-
dc.date.accessioned	2023-06-16T12:27:30Z	-
dc.date.available	2023-06-16T12:27:30Z	-
dc.date.issued	2016	-
dc.identifier.uri	https://tez.yok.gov.tr/UlusalTezMerkezi/TezGoster?key=cbOXH84ZayrLjc0tI-QXKqX_pVqTMAxcFGcOTPodUi9B1bFq-QhBSpUTVju6wdJk	-
dc.identifier.uri	https://hdl.handle.net/20.500.14365/34	-
dc.description.abstract	Çok sözcüklü ifade, doğal dillerde, sözcüklerin anlam bütünlüğü oluşturmak üzere tekrarlayan kombinasyonlarıdır. Metinlerden çok sözcüklü ifadelerin belirlenmesi bir çok doğal dil işleme uygulamaları ( Doğal dil üretme, hesaplamalı sözlükbilim, makine çevirileri vb.) için çok önemli bir konudur. çok sözcüklü ifadelerin belirlenmesi için gözlenme sıklığı bağımlı yöntemler ( Bileşik olasılık (joint probability), noktasal karşılıklı bilgi katsayısı (pointwise mutual information), karşılıklı bağlılık (mutual dependency) v.b) sıklıkla kullanılır. Bu yöntemlerin en büyük dezavantajı, çok sözcüklü ifadelerin belirlenmesinin performansının frekansın ölçüldüğü veri kaynağının büyüklüğüne bağlı olmasıdır. Bu tezin amacı, küçük veri setlerinin yarattığı problemlerin önüne geçmek için bilinen en büyük veri kaynağı olan web'i kullanarak gözlenme sıklığını elde etmektir. Bu tezde, 2 farklı aday veri seti kullanılarak, Türkçe dili için frekans tabanlı çok sözcüklü ifade belirleme metotlarının performansı araştırılmıştır. Veri setlerindeki adayların gözlenme sıklığı bilgisi popüler bir arama motoru olan Google kullanılarak elde edilmiştir. Aday çok sözcüklü ifadelerin arama motoruna sorgu olarak gönderildiğinde alınan sayfa sayısı (ing. page count) adayın gözlenme sıklığı olarak kabul edilmiştir. Kullanılan 20 yöntemin başarısı anma(recall), duyarlılık(precision) ve F-ölçütü (F-measure) ile değerlendirilmiştir. Web tabanlı frekans bilgisinin çok sözcüklü ifadelerin belirlenmesindeki performansı geleneksel derlem tabanlı frekans ile karşılaştırılmıştır ve çok sözcüklü ifadelerin belirlenmesinde web verilerinin kullanılması umut verici sonuçlar göstermiştir.	en_US
dc.description.abstract	Multiword expressions (MWEs) are recurrent combinations of words in natural languages. The extraction of MWEs in a text is significant for a number of natural language processing applications (e.g. natural language generation, computational lexicography, machine translation etc.). There are various occurrence frequency based methods (e.g. joint probability, pointwise mutual information and mutual dependency) that are used frequently for MWE extraction ([12],[13]). The major disadvantage of these methods is that extraction performance depends mainly on the size of the data set in which the occurrence frequency is measured. The main goal of this thesis is obtaining the frequency from a massive data source, the World Wide Web, in order to by-pass the negative e ect of small data set. In this thesis, we applied frequency based MWE extraction methods on two Turkish MWE data sets. The occurrence frequencies of MWE candidates in data sets are obtained from popular search engine Google. The retrieved page counts when the candidates are sent as queries to Google are employed as the occurrence frequencies. The evaluation of the 20 frequency based methods is performed by precision, recall and F-measures. The performance of web-based frequencies in identification of MWEs is compared to the traditional corpus based frequencies and it is showed that the use of web data in identification of MWEs reveals promising results.	en_US
dc.language.iso	en	en_US
dc.publisher	İzmir Ekonomi Üniversitesi	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	çok sözcüklü ifade	en_US
dc.subject	sıklık tabanlı yöntemler	en_US
dc.subject	web verisi.	en_US
dc.subject	Multiword expression	en_US
dc.subject	frequency based methods	en_US
dc.subject	web data.	en_US
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol	en_US
dc.subject	Computer Engineering and Computer Science and Control	en_US
dc.subject	Bilgisayarlı dil bilim	en_US
dc.subject	Computerized linguistics	en_US
dc.subject	Derlem dil bilim	en_US
dc.subject	Corpus linguistics	en_US
dc.subject	Doğal dil	en_US
dc.subject	Natural language	en_US
dc.subject	Metin dil bilim	en_US
dc.subject	Text linguistics	en_US
dc.title	Identification of Multiword Expressions in Turkish Based on Web Data	en_US
dc.title.alternative	Web Verisi Kullanılarak Türkçe Çok Sözcüklü İfadelerin Belirlenmesi	en_US
dc.type	Master Thesis	en_US
dc.department	İEÜ, Lisansüstü Eğitim Enstitüsü, Bilgisayar Mühendisliği Ana Bilim Dalı	en_US
dc.identifier.startpage	1	en_US
dc.identifier.endpage	55	en_US
dc.institutionauthor	Uymaz, Hande Aka	-
dc.relation.publicationcategory	Tez	en_US
dc.identifier.yoktezid	434360	en_US
dc.identifier.scopusquality	N/A	-
dc.identifier.wosquality	N/A	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.cerifentitytype	Publications	-
item.grantfulltext	open	-
item.fulltext	With Fulltext	-
item.openairetype	Master Thesis	-
item.languageiso639-1	en	-
crisitem.author.dept	05.04. Software Engineering	-
Appears in Collections:	Lisansüstü Eğitim Enstitüsü Tez Koleksiyonu

Files in This Item:

File	Size	Format
2506.pdf	3.56 MB	Adobe PDF	View/Open

Show simple item record

CORE Recommender

Page view(s)

276

checked on Sep 8, 2025

Download(s)

44

checked on Sep 8, 2025

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM