Türkçe istenmeyen e-postaların derin öğrenme ile tespit edilmesi

Autor:	Eryılmaz, Ersin Enes
Přispěvatelé:	Kılıç, Erdal, OMÜ, Lisansüstü Eğitim Enstitüsü, Bilgisayar Mühendisliği Ana Bilim Dalı
Jazyk:	turečtina
Rok vydání:	2021
Předmět:	Google Colaboratory GRU hiperparametre ince ayar hyperparameter fine tuning deep learning makine öğrenmesi RNN BLSTM derin öğrenme DistilBERT machine learning istenmeyen e-posta tespiti spam detection LSTM BERT Keras
Popis:	Tam Metin / Tez E-postalar günümüzün en etkili iletişim araçlarından biridir. E-postaların içinde meşru e-postalar bulunduğu gibi istenmeyen e-postalar da bulunmaktadır. Yaramaz, önemsiz, gereksiz e-posta anlamına istenmeyen e-postalar internet kullanıcılarına maddi ve manevi ciddi zararlar vermekte olup internet trafiğini de meşgul etmektedir. İstenmeyen e-postaların tespiti için birçok yöntem bulunmakla birlikte mevcut çözümler çoğunlukla spam göndericilerin yenilikçiliğinin ve geliştirdiği tekniklerin gerisinde kalmaktadır. Bu tez çalışmasında literatürde bulunan istenmeyen epostaların tespitinde kullanılan yöntemler incelenmiş olup Türkçe istenmeyen e-posta tespiti için 6 farklı model önerilmiştir. 4 farklı derin öğrenme modeli Python programlama dili Keras kütüphanesi kullanılarak Spyder geliştirme ortamı ile geliştirilmiştir. Önerilen derin öğrenme modelleri RNN, LSTM, GRU ve BLSTM modelleridir. 2 farklı derin öğrenme modeli ve hiperparametre ince ayarı ile en iyi hiperparametre seçimi internet tabanlı Google Colaboratory ile geliştirilmiştir. Google Colaboratory ile test edilen derin öğrenme modelleri BERT ve DistilBERT modelleridir. Google Colaboratory ile de Tensorflow tabanlı Keras kütüphanesi kullanılmaktadır. İstenmeyen e-posta tespitinde önerilen modeller geliştirilirken 400 adet istenmeyen, 400 adet meşru olmak üzere toplam 800 adet Türkçe e-posta veri kümesi kullanılmıştır. Bu modellerden 5 katlamalı çapraz doğrulama ile BLSTM 0.0373 ile en az test kaybına sahip olup LSTM ve BLSTM istenmeyen e-posta tespitinde %99.38 başarım oranına ulaşmıştır. İnce ayarlı BERT modeli ise %98.75 başarım oranına ulaşmıştır. RNN derin öğrenme modeli için hiperparametre ince ayarı Izgara Arama tahmin edici ile yapılmıştır. Hiperparametre ince ayarı yapılarak %97.66 başarım elde edilmiştir. Ayrıca tez çalışması kapsamında 350 adet e-posta içeren yeni bir Türkçe e-posta veri kümesi oluşturulmuştur. Daha sonraki çalışmalarda bu e-posta veri kümesinin boyutu artırılarak derin öğrenme modellerinde deneyler yapılması düşünülmektedir. E-mails are one of today's most effective communication tools. E-mails contain legitimate e-mails as well as spam e-mails. Spam e-mails, which mean naughty, junk, unnecessary e-mails, cause serious material and moral damage to internet users and also occupy internet traffic. Although there are many methods of detecting spam emails, current solutions often fall behind the innovation and techniques developed by spammers. In this thesis, the methods used in the detection of unsolicited e-mails in the literature were examined and 6 different models were proposed for the detection of spam e-mails in Turkish. 4 different deep learning models were developed with the Spyder development environment using the Python programming language Keras library. Recommended deep learning models are RNN, LSTM, GRU and BLSTM models. With 2 different deep learning models and hyperparameter fine-tuning, the best hyperparameter selection has been developed with the internet-based Google Colaboratory. Deep learning models tested with Google Colaboratory are BERT and DistilBERT models. Tensorflow-based Keras library is also used with Google Colaboratory. While developing the suggested models for spam detection, a total of 800 Turkish e-mail data sets, 400 of which are spam and 400 are legitimate, were used. Among these models, 5-fold cross validation has the least test loss with BLSTM 0.0373, and LSTM and BLSTM have achieved 99.38% success rate in spam detection. The fine tuned BERT model has achieved 98.75% performance rate. Hyperparameter fine-tuning for the RNN deep learning model was done with the Grid Search estimator. A performance of 97.66% was achieved by fine tuning the hyperparameter. Also, a new Turkish e-mail data set containing 350 e-mails was created within the scope of the thesis study. In future studies, it is planned to increase the size of this e-mail data set and experiment with deep learning models.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=od______9773::6ec2a7c768d089d49857d5b17853937a http://libra.omu.edu.tr/tezler/135964.pdf Zobrazit plný text záznamu