MementoMap Framework for Flexible and Adaptive Web Archive Profiling
Autor: | Sawood Alam, Michael L. Nelson, Fernando Melo, Daniel Gomes, Michele C. Weigle, Daniel Bicho |
---|---|
Rok vydání: | 2019 |
Předmět: |
FOS: Computer and information sciences
Binary search algorithm Information retrieval Computer science Web archiving 05 social sciences Listing (computer) Computer Science - Digital Libraries 02 engineering and technology computer.file_format File format computer.software_genre WAR News aggregator Index (publishing) 020204 information systems Web page 0202 electrical engineering electronic engineering information engineering Digital Libraries (cs.DL) 0509 other social sciences 050904 information & library sciences computer |
Zdroj: | JCDL |
DOI: | 10.48550/arxiv.1905.12607 |
Popis: | In this work we propose MementoMap, a flexible and adaptive framework to efficiently summarize holdings of a web archive. We described a simple, yet extensible, file format suitable for MementoMap. We used the complete index of the Arquivo.pt comprising 5B mementos (archived web pages/files) to understand the nature and shape of its holdings. We generated MementoMaps with varying amount of detail from its HTML pages that have an HTTP status code of 200 OK. Additionally, we designed a single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup. We analyzed more than three years of MemGator (a Memento aggregator) logs to understand the response behavior of 14 public web archives. We evaluated MementoMaps by measuring their Accuracy using 3.3M unique URIs from MemGator logs. We found that a MementoMap of less than 1.5% Relative Cost (as compared to the comprehensive listing of all the unique original URIs) can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive while maintaining 100% Recall (i.e., zero false negatives). Comment: In Proceedings of JCDL 2019; 13 pages, 9 tables, 13 figures, 3 code samples, and 1 equation |
Databáze: | OpenAIRE |
Externí odkaz: |