Popis: |
Web Usage Mining, also known as Web Log Mining, is the result of user interaction with a Web server including Web logs, click streams and database transaction or the visits of search engine crawlers at a Website. Log files provide an immense source of information about the behavior of users as well as search engine crawlers. Web Usage Mining concerns the usage of common browsing patterns, i.e. pages requested in sequence from Web logs. These patterns can be utilized to enhance the design and modification of a Website. Analyzing and discovering user behavior is helpful for understanding what online information users inquire and how they behave. The analyzed result can be used in intelligent online applications, refining Websites, improving search accuracy when seeking information and lead decision makers towards better decisions in changing markets, for instance by putting advertisements in ideal places. Similarly, the crawlers or spiders are accessing the Websites to index new and updated pages. These traces help to analyze the behavior of search engine crawlers. The log files are unstructured files and of huge size. These files need to be extracted and pre-processed before any data mining functionality to follow. Pre-processing is done in unique ways for each application. Two pre-processing algorithms are proposed based on indiscernibility relations in rough set theory which generates Equivalence Classes. The first algorithm generates a pre-processed file with successful user requests while the second one generates a pre-processed file for pre-fetching and caching purposes. Two algorithms are proposed to extract usage analytics. The first algorithm identifies the origin of visits, the top referring sites and the most popular keywords used by the visitor to arrive at a Website. The second algorithm extracts user agents like browsers and operating systems used by a visitor to access a Website. In this study, clustering of users based on Entry Pages to a Website is done to analyze the deep linked traffic at a Website. The Top Ten Entry Pages, the traffic and the temporal information of the Top Ten Entry Pages are also studied. |