Popis: |
Traditional clustering methods, where documents are represented by term frequency vectors, are not very suitable for Lithuanian document clustering as there is no any freely available morphological analyzer or stemmer to make compact term dictionaries. It is still possible though to cluster Lithuanian documents using loose term dictionaries, but as Lithuanian is a highly synthetic language significant increase in resources and possibly inaccurate or distorted results must be taken into account. In this master thesis a clustering method for personally classified documents is developed to overcome shortcomings of traditional document clustering stated above. In a new method documents are represented by tag frequency vectors, pair-wise similarities are measured by cosine coefficient and clustering itself is performed using experimentally selected bisecting K‑means algorithm. Experiments comparing developed method with traditional document clustering using loose term dictionary showed that former copes better with large document collections and/or large cluster number. At the same time subjective clustering estimation showed that even when new method demonstrates larger entropy and lower purity values, it still overcomes traditional method by clustering sense. |