Some Issues in Large Scale Data Gathering

Autor: Sung-ting Tsai, 蔡松廷
Rok vydání: 2002
Druh dokumentu: 學位論文 ; thesis
Popis: 90
There is extreme large amount of data in the World Wide Web, and it's growing rapidly in every second. For a good search engine, it is not easy but necessary to construct an efficient crawling system with high efficiency and reliability to gather the data in the World Wide Web. The research in the thesis is based on the gais robot project that was created by Cauchy Lin [1]. During the program gathering the data and after that, we find lots of problem when we view the result. For example, the crawler will be trapped when grabbing some web sites and grab a lot of useless pages. To solve these problems and make the grabbing more efficient, we will analyze the web site and URL according to their behaviors. First of all, we will build a site database(SiteDB) to store and count information of sites. We will provide fast query and analysis. Based on SiteDB, we will define some site types. Second, we will do some experiment on URLs by its characteristic to gather statistics. Finally we conclude some effective rules and suggestions. In other issues, we will talk about the problems when program parsing robots.txt and frame pages.
Databáze: Networked Digital Library of Theses & Dissertations