Design of Index Structures for Supporting Personalized Information Filtering on the Internet

Autor: Tsu-I Chen, 陳姿伊
Rok vydání: 2003
Druh dokumentu: 學位論文 ; thesis
Popis: 91
Owing to the booming development of the WWW, it creates many new challenges for information filtering. Information Filtering (IF) is an area of research that develops tools for discriminating between relevant and irrelevant information. IF can find good matches between the web pages and the users'' information needs. Users first give descriptions about what they need, i.e., user profiles, to start the services. A profile index is built on these profiles. A series of incoming web pages will be put into the matching process. Each incoming web page is represented in the same form of the user profile. In this way, the users who are interested in an incoming web page can be identified by comparing the descriptions of the web page with each user profile. At last, the web page will be recommended to the users whose profiles belong to the filtered results. Therefore, a critical issue of the information filtering service is how to index the user profiles for an efficient matching process. When we index the user profile, we can reduce the costs of storage space and the processing time for modifying the user profiles. In this thesis, first, we propose a count-based tree method, which takes the count of each keyword into consideration, to reduce the large storage spaces as needed by the tree method. Next, three large-itemset-based methods are proposed to reduce the storage space, which are called the count-major large itemset method, the weighted large itemset method and the hybrid method. In these three large-itemset-based methods, we first cluster profiles with similar interests into the same group. Next, for each cluster, we apply the mining association rules techniques to help us to construct the index strategies. We design three methods by using the idea of the Apriori algorithm which is one of well-known approaches in mining association rules. But, we modify the minimum support and the goal in the Apriori algorithm. We may not always output the large itemset Lk. That is, we may only use Lw, where w < k. In summary, the cost of storage spaces of our four methods are less than that of the tree method proposed by Yan and Garcia-Molina. According to our simulation results, each of our four methods may provide the best result when different input data sets. Next, we propose a large-itemset-based approach to the incremental update of the index structure for storing keywords to reduce the update cost. When someone''s interests are often changed, we must care about the way how to provide the low update cost of the index structure. We take the weight of each keyword into consideration. That is, each keyword can be distinguished the long-term interest which has weight above the threshold from the short-term interest which has weight below the threshold. Owing to that the probability of modifying the short-term interests is higher than that of modifying the long-term interests, we can update the short-term interests locally. According to our simulation results, our method really can reduce the update cost as needed by Wu and Chen'' methods.
Databáze: Networked Digital Library of Theses & Dissertations