Learning Robust Dense Retrieval Models from Incomplete Relevance Labels
Autor: | Prafull Prakash, Hamed Zamani, Julian Killingback |
---|---|
Rok vydání: | 2021 |
Předmět: |
Matching (statistics)
business.industry Computer science Computer Science::Information Retrieval Deep learning Sampling (statistics) Machine learning computer.software_genre Ranking (information retrieval) Set (abstract data type) Sampling distribution Search algorithm Relevance (information retrieval) Artificial intelligence business computer |
Zdroj: | SIGIR |
DOI: | 10.1145/3404835.3463106 |
Popis: | Recent deployment of efficient billion-scale approximate nearest neighbor (ANN) search algorithms on GPUs has motivated information retrieval researchers to develop neural ranking models that learn low-dimensional dense representations for queries and documents and use ANN search for retrieval. However, optimizing these dense retrieval models poses several challenges including negative sampling for (pair-wise) training. A recent model, called ANCE, successfully uses dynamic negative sampling using ANN search. This paper improves upon ANCE by proposing a robust negative sampling strategy for scenarios where the training data lacks complete relevance annotations. This is of particular importance as obtaining large-scale training data with complete relevance judgment is extremely expensive. Our model uses a small validation set with complete relevance judgments to accurately estimate a negative sampling distribution for dense retrieval models. We also explore leveraging a lexical matching signal during training and pseudo-relevance feedback during evaluation for improved performance. Our experiments on the TREC Deep Learning Track benchmarks demonstrate the effectiveness of our solutions. |
Databáze: | OpenAIRE |
Externí odkaz: |