Popis: |
The continuing growth of information overflow has made it hard to obtain valuable information on the web. In this trend, the need for effective Information Retrieval (IR) technique has been increased. Although document data contain much more abundant information, users can retrieve necessary information only from the title and description in conventional web services. In order to meet the demands for fast and accurate retrieval of valuable information, we propose a fast and effective content-based document information retrieval system that retrieves the information from the actual content of a document. The proposed method is based on a topic model of Latent Dirichlet Allocation that is used to extract major keywords for a given document. The main contributions of our system are the increased flexibility, effectiveness, and fast retrieval of information. Our system can easily communicate with existing web service through the standard JSON format. In addition, we increase the speed of information retrieval by using NoSQL based database system with inverted indexing and B-tree based indexing. We validate the performance of our system on real data collected from the SlideShare service. The proposed system shows better retrieval performance over the existing IR system. |