Dimensionality Reduction in Webpage Categorization Using Probabilistic Latent Semantic Analysis and Adaptive General Particle Swarm Optimization
Autor: | Chunzhi Wang, Yala Tong |
---|---|
Rok vydání: | 2009 |
Předmět: |
Optimization problem
Probabilistic latent semantic analysis business.industry Dimensionality reduction Feature vector Ant colony optimization algorithms Evolutionary algorithm Particle swarm optimization Pattern recognition computer.software_genre Data mining Artificial intelligence Multi-swarm optimization business computer Mathematics |
Zdroj: | 2009 International Workshop on Intelligent Systems and Applications. |
Popis: | A new method of text dimension reduction is brought forward based on probabilistic latent semantic analysis(PLSA) and adaptive general particle swarm optimization (AGPSO). PLSA is used to specify essential associative semantic relationships instead of the original document space. The dimension can be reduced greatly by Expectation Maximization algorithm. A crossover operator is designed to simulate the flying velocity alteration and a mutation operator is used to keep the population diversity. Besides these, an adaptive strategy is introduced to adjust probability of crossover and mutation just in order to obtain optimal feature set. The experimental results indicate that the algorithm can not only reduce dimension, but also improve categorization precision. In essence, reduction procedure of text feature vectors is an optimization procedure that searches for minimal feature subset representing problem space. It can be boiled down to 0-1 combination optimization problem. In the light of the success of evolutionary algorithm which is applying to combination optimization problem similar to TSP, many researchers make use of evolutionary algorithm, including Genertic algorithm(5), Ant Colony Algorithm(6) and Particle Swarm Optimization(7), to solve dimensionality reduction. However, these technologies are all based on VSM which determine the document category by calculating the distance between feature vector. Such a representation, called the "bag of word" approach, has the following drawbacks. Firstly, a large number of features are required for document representation. Secondly, it totally neglects semantic similarities between two words, which could be important from a classification viewpoint. The main contribution of the paper is presenting a method using probabilistic latent semantic analysis (PLSA) and adaptive general particle swarm optimization (AGPSO) which materializes latent semantic in vector space by PLSA and span intermediate feature set by Expectation Maximization (EM) algorithm, and AGPSO is exployed to continuous reduction on this basis. The experimental results show that the algorithm could not only reduce dimension, but also improve categorization precision. |
Databáze: | OpenAIRE |
Externí odkaz: |