Kısa Metinleri Yazıldıkları Dile Göre Sınıflandırma ve Farklı Öznitelik Seçim Yöntemlerinin Uygulanması

Autor:	ASLANYÜREK, Murat, MESUT, Altan
Jazyk:	turečtina
Rok vydání:	2021
Předmět:	Engineering Multidisciplinary Mühendislik Ortak Disiplinler Language recognition Fasttext Langdetect Machine learning Dil tanıma Fasttext Langdetect Makine öğrenmesi
Zdroj:	Volume: 4, Issue: 2 36-46 Journal of Investigations on Engineering and Technology
ISSN:	2687-3052
Popis:	In this study, a classification process for language recognition has been performed on two data sets of different sizes consisting of Wikipedia article abstracts. Dataset group A consists of article abstracts of 204 bytes and less, while dataset group B consists of abstracts of between 204 and 512 bytes. The first goal of the study is to determine the appropriate machine learning and attribute selection method according to the sizes of the short texts. The second goal is to determine the fastest and most accurate classification method. As a result of the tests performed; the highest accuracy value has been achieved by using SelectFromModel-Logistic Regression in atributee selection, while as a machine learning method, Naive Bayes Multinominal and Naive Bayes Bernoilli have been superior to each other according to data sets of different lengths. In addition, as a result of the tests performed with all classification methods used in the study, it has been understood that fasttext is superior in terms of accuracy and WBSM in terms of speed in both data sets compared to other classification methods. Bu çalışmada Wikipedia makale özetlerinden oluşan farklı boyutlardaki iki veri seti üzerinde dil tanımaya yönelik sınıflandırma işlemi yapılmıştır. A veri seti grubu 204 bayt ve daha kısa makale özetlerinden oluşurken, B veri seti grubu 204 ile 512 bayt arasındaki özetlerden oluşmaktadır. Çalışmadaki birinci hedef kısa metinlerin boyutlarına göre uygun makine öğrenmesi ve öznitelik seçme yönteminin belirlenmesidir. İkinci hedef ise en hızlı ve yüksek doğrulukla sınıflandırma yapan yöntemin tespit edilmesidir. Yapılan testler sonucunda öznitelik seçiminde SelectFromModel-Lojistik Regresyon kullanılması ile en yüksek doğruluk değerine ulaşılırken, makine öğrenmesi yöntemi olarak Naive Bayes Multinominal ve Naive Bayes Bernoilli farklı uzunluktaki veri setlerine göre birbirlerine üstünlük sağlamaktadır. Ayrıca çalışmada kullanılan tüm sınıflandırma yöntemleri ile yapılan testler sonucunda, her iki veri setinde diğer sınıflandırma yöntemlerine göre fasttext’in doğruluk bakımından, KTİY’nin ise hız bakımından üstünlük sağladığı anlaşılmıştır
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=tubitakulakb::0d0067401d601b4613b6a32d00b2c801 https://dergipark.org.tr/tr/pub/jiet/issue/67435/1001758 Zobrazit plný text záznamu