The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Autor: David Zeber, Camila Oliveira, Martin Lopatka, Fredrik Wollsén, Ilana Segall, Walter Rudametkin, Sarah Bird
Přispěvatelé: Mozilla, Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Self-adaptation for distributed services and large software systems (SPIRALS), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), ANR-19-CE39-0002,FP-Locker,FP-Locker : Renforcer l'authentification web grâce aux empreintes de navigateur(2019)
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: The Web Conference
The Web Conference, Apr 2020, Taipei, Taiwan. ⟨10.1145/3366423.3380104⟩
WWW
DOI: 10.1145/3366423.3380104⟩
Popis: International audience; Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments , and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.
Databáze: OpenAIRE