Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content from URL-Logged Web-Browsing Digital Trace Data

Autor: Ross Dahlke, Deepak Kumar, Zakir Durumeric, Jeffrey Hancock
Rok vydání: 2023
DOI: 10.31219/osf.io/bkpqt
Popis: Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this unstated assumption the problematic assumption of persistence. To examine the extent to which this assumption may exist, we examine over 21 million URL-logged web browser visits from 1,515 participants over four months and record the degree to which hard news and misinformation URLs individuals visited were persistent, inaccessible, or ephemeral. While we find that the URLs collected are largely persistent, we find there are systematic biases in which URLs are ephemeral and inaccessible. For example, conservative misinformation URLs are more likely to be ephemeral than other types of misinformation. To standardize the reporting and understanding of the problematic assumption of persistence, we offer a set of metrics, PersistenceRate, InaccessibilityRate, EphemeralityRate (PIE metrics), that future research should report when using digital trace and web scraping data.
Databáze: OpenAIRE