Blog Post Extraction and Irrelevant Blog Filtering for Opinion Search Engine

Autor: Ping-hua Yang, 楊萍華
Rok vydání: 2009
Druh dokumentu: 學位論文 ; thesis
Popis: 97
Blogosphere are consisted of blog is a social network, and blogs which are the most popular in the top websites are increased by years. Blog pages are consisted of variety of topics and posted content is not only included objective opinions but also subjective opinions. In past users could get information by TV, magazine or search engine when they need to know some specific problem, but in those ways not only consume more time cost but also get limited information usually. For these reasons, in this paper we provide an opinion search engine on blogsphere which combines blog and search engine, focus on specific topics to show public opinions. Our blog opinion search engine which returns opinions by two ways, one is online system that responses opinions quickly by few fixed domain pages and the other is background system that update opinion which user can know newer information in large number of blog pages by any domains periodically. Because it is impossible for retrieving blog posted content by manually adding pattern in different blog website, we use machine learning to extract posted content, but those pages which consist of non-blog pages will reduce extraction performance and so we construct a blog and nonblog classifier which F-Measure is 90.7% can filter nonblog pages efficiently and raise extraction performance more than 10% F-Measure. Furthermore, according to positive block and negative blocks in a blog page are unbalanced which are called imbalance data, we adopt different way to solve this. In filtering irrelevant pages we add expansion words in original method which improve about 61% F-measure.
Databáze: Networked Digital Library of Theses & Dissertations