INFOSHIELD: Generalizable Information-Theoretic Human-Trafficking Detection

Autor:	Catalina Vajiac, Sacha Levy, Meng-Chieh Lee, Reihaneh Rabbany, Christos Faloutsos, Namyong Park, Aayushi Kulshrestha, Cara Jones
Rok vydání:	2021
Předmět:	Matching (statistics) Information retrieval business.product_category Computer science Laptop Scalability Law enforcement Spotting business F1 score Domain (software engineering) Interpretability
Zdroj:	ICDE
DOI:	10.1109/icde51399.2021.00101
Popis:	Given a million escort advertisements, how can we spot near-duplicates? Such micro-clusters of ads are usually signals of human trafficking. How can we summarize them, visually, to convince law enforcement to act? Can we build a general tool that works for different languages? Spotting micro-clusters of near-duplicate documents is useful in multiple, additional settings, including spam-bot detection in Twitter ads, plagiarism, and more.We present INFOSHIELD, which makes the following contributions: (a) Practical, being scalable and effective on real data, (b) Parameter-free and Principled, requiring no user-defined parameters, (c) Interpretable, finding a document to be the cluster representative, highlighting all the common phrases, and automatically detecting "slots", i.e. phrases that differ in every document; and (d) Generalizable, beating or matching domain-specific methods in Twitter bot detection and human trafficking detection respectively, as well as being language-independent finding clusters in Spanish, Italian, and Japanese. Interpretability is particularly important for the anti human-trafficking domain, where law enforcement must visually inspect ads.Our experiments on real data show that INFOSHIELD correctly identifies Twitter bots with an F1 score over 90% and detects human-trafficking ads with 84% precision. Moreover, it is scalable, requiring about 8 hours for 4 million documents on a stock laptop.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::77fffd61a86a794ee62c7dc0d1fba1ac https://doi.org/10.1109/icde51399.2021.00101 Zobrazit plný text záznamu