Popis: |
Due to the explosive growth of the web that has occurred throughout its history, many researchers working on web corpora have begun to move toward distributed, data parallel computing. The size of the ClueWeb09 [2] corpus, at approximately one billion documents, is an indication of this. Even limiting the collection to only documents in the English language only halves the size of the collection. In this work, we describe the collection of information retrieval algorithms we have implemented using DryadLINQ [8]. DryadLINQ is a data parallel processing system that allows programmers to write distributed programs without worrying about the implementation of a distributed system. DryadLINQ executes programs containing SQL-like Language Integrated Query statements (LINQ) by shipping the computation to nodes in the cluster for parallel execution. The ability to break a computation into many pieces that can be processed on individual machines means that even a small number of computers can be leveraged to reduce the time necessary to process large collections. When researchers first obtain a collection of web documents, there is a substantial amount of preprocessing before analysis can commence. The toolkit assists with parsing, link extraction, associating discovered anchor text with the referenced document. Once the document content and links are in a standard format, then further processing can be performed. The toolkit provides implementations of textbased retrieval methods (BM25 [7] and BM25F [9]), queryindependent link based scoring functions (PageRank, indegree, and trans-domain indegree), query-dependent linkbased scoring functions (SALSA-SETR [6]). Additionally, the toolkit provides an implementation of shingle based duplicate document detection [1], n-gram extraction, and a mechanism to build an inverted index. The algorithms included in this toolkit include both traditional algorithms as well as recent research results. Elements |