Popis: |
Given a million escort advertisements, how can we spot near-duplicates? Such micro-clusters of ads are usually signals of human trafficking. How can we summarize them, visually, to convince law enforcement to act? Can we build a general tool that works for different languages? Spotting micro-clusters of near-duplicate documents is useful in multiple, additional settings, including spam-bot detection in Twitter ads, plagiarism, and more.We present INFOSHIELD, which makes the following contributions: (a) Practical, being scalable and effective on real data, (b) Parameter-free and Principled, requiring no user-defined parameters, (c) Interpretable, finding a document to be the cluster representative, highlighting all the common phrases, and automatically detecting "slots", i.e. phrases that differ in every document; and (d) Generalizable, beating or matching domain-specific methods in Twitter bot detection and human trafficking detection respectively, as well as being language-independent finding clusters in Spanish, Italian, and Japanese. Interpretability is particularly important for the anti human-trafficking domain, where law enforcement must visually inspect ads.Our experiments on real data show that INFOSHIELD correctly identifies Twitter bots with an F1 score over 90% and detects human-trafficking ads with 84% precision. Moreover, it is scalable, requiring about 8 hours for 4 million documents on a stock laptop. |