Online Algorithms for Estimating Change Rates of Web Pages

Autor: Gugan Thoppe, Konstantin Avrachenkov, Kishor Patil
Přispěvatelé: Network Engineering and Operations (NEO ), Inria Sophia Antipolis - Méditerranée (CRISAM), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Indian Institute of Science [Bangalore] (IISc Bangalore), Projet PIA - ANSWER - FSN2 (P159564-2661789\DOS0060094)
Rok vydání: 2020
Předmět:
FOS: Computer and information sciences
Computer Science - Machine Learning
Optimization problem
Theoretical computer science
Momentum
Computer Networks and Communications
Computer science
Stochastic approximation
0211 other engineering and technologies
Database synchronization
Machine Learning (stat.ML)
02 engineering and technology
Change rate
[INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]
Computer Science - Information Retrieval
Machine Learning (cs.LG)
[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG]
[STAT.ML]Statistics [stat]/Machine Learning [stat.ML]
Statistics - Machine Learning
020204 information systems
Convergence (routing)
Synchronization (computer science)
Web page
0202 electrical engineering
electronic engineering
information engineering

FOS: Mathematics
Online algorithm
Heavy ball
Social and Information Networks (cs.SI)
021103 operations research
Probability (math.PR)
Web crawling
Computer Science - Social and Information Networks
[MATH.MATH-PR]Mathematics [math]/Probability [math.PR]
Hardware and Architecture
Modeling and Simulation
[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR]
Cache
Web crawler
Software
Mathematics - Probability
Information Retrieval (cs.IR)
Zdroj: Performance Evaluation
Performance Evaluation, 2022, SI: ValueTools 2020, 153, pp.25. ⟨10.1016/j.peva.2021.102261⟩
Performance Evaluation, Elsevier, 2021, SI: ValueTools 2020, 153, ⟨10.1016/j.peva.2021.102261⟩
ISSN: 0166-5316
DOI: 10.48550/arxiv.2009.08142
Popis: A search engine maintains local copies of different web pages to provide quick search results. This local cache is kept up-to-date by a web crawler that frequently visits these different pages to track changes in them. Ideally, the local copy should be updated as soon as a page changes on the web. However, finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. This brings forth the following optimization problem: maximize the freshness of the local cache subject to the crawling frequencies being within prescribed bounds. While tractable algorithms do exist to solve this problem, these either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We address this issue here. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per iteration. The first is based on the law of large numbers and the second on stochastic approximation. The third is an extension of the second and includes a heavy-ball momentum term. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. Our main theoretical results concern asymptotic convergence and convergence rates of these three schemes. In fact, our work is the first to show convergence of the original stochastic heavy-ball method when neither the gradient nor the noise variance is uniformly bounded. We also provide some numerical experiments (based on real and synthetic data) to demonstrate the superiority of our proposed estimators over existing ones such as MLE. We emphasize that our algorithms are also readily applicable to the synchronization of databases and network inventory management.
Comment: This is the author version of the paper accepted to {\it International Journal of Performance Evaluation}, Elsevier; 25 pages. arXiv admin note: text overlap with arXiv:2004.02167
Databáze: OpenAIRE