Classifying and Understanding Tor Traffic Using Tree-Based Models

Autor:	Adrian Lara, Jose Guevara-Coto, Paulo Calvo
Rok vydání:	2020
Předmět:	business.industry Computer science Network packet ComputerSystemsOrganization_COMPUTER-COMMUNICATIONNETWORKS Decision tree 020206 networking & telecommunications 02 engineering and technology Infrastructure security computer.software_genre Security controls Random forest 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing The Internet Data mining business TCP window scale option computer Anonymity
Zdroj:	LATINCOM
DOI:	10.1109/latincom50620.2020.9282317
Popis:	Over the past years the use of anonymization services has gained significant relevance as more users are interested in protecting their data and privacy on the internet. One of the most popular ways to achieve this result is Tor. The anonymity and untraceability that Tor provides, however, can also be used by ill-intentioned users who try to take advantage of bypassing security control and policies. The Cybersecurity and Infrastructure Security Agency (CISA) mentions two methods of recognizing Tor traffic in the enterprise: indicator- or behavior-based analysis. The first one uses log analysis and lists of Tor exit nodes to identify the suspicious activity while the latter inspects patterns in TCP and UDP ports, DNS queries and inspecting the payload of the packets. In this paper, we propose a different approach using white-box machine learning models such as decision trees and Random Forest. On the one hand, our classifier achieves accuracy levels above 95%. On the other hand, our approach is the first one to allow understanding the importance of each traffic feature in the classification. Our results demonstrate that the TCP window size, the frame size and time related traffic features can be used to identify Tor traffic. In this paper we will describe a Machine Learning methodology used to identify Tor network traffic utilizing decision trees C5.0 and Random Forest. We followed a white-box approach and accomplished accuracy of over 95% in the prediction in both models. We also present an analysis of the importance of the top predictor variables.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f5a61ab80d407bcffa8ea2c0bcd365af https://doi.org/10.1109/latincom50620.2020.9282317 Zobrazit plný text záznamu