Popis: |
The abundance of high dimensional data has necessitated various approaches to dimensionality reduction. Many of these methods make simplifying assumptions about the variation structure of the data and permit approximating high-dimensional structures using only a few components. In the context of modern day machine learning applications, such assumptions can rarely be justified. To this end, Tenenbaum, de Silva, and Langford [2000] proposed an approach for approximating the underlying non-linear structure on which the variation occurs and such assumptions would be more tenable. This is achieved by first finding the non-linear underlying structure (manifold) in a high dimensional input space and then using the shortest distance matrix of the data points along this manifold to achieve non-linear dimensionality reduction. The authors proposed a 3 step approach : Step 1 - constructing a distance graph by defining the neighbours of each data point such that it is contained within one section of the manifold, Step 2 - approximating the geodesic distances between all points using the graph constructed in Step 1 and Step 3 - applying Classical MDS to this distance graph to achieve dimensionality reduction. The first objective of this research paper is to demonstrate that the approach by Tenenbaum et al. [2000] struggles to detect the manifold under conditions where the stochastic components of the data dominates to the extent that the underlying manifold may be difficult to approximate reliably. We propose and explore possible solutions on how to effectively deal with the variability in the data, such that the distance graph reliably reflect the geodesic interpoint distances. In principle, this deals with departures from the assumptions in Step 1 (constructing distance graph) and the remediation thereof. The second objective, assuming that one can reliably detect the manifold, relates to the often substantial computational overhead of the algorithm: Step 2 of the algorithm attempts to approximate geodesic distances on the manifold by traversing paths on the graph constructed in Step 1, more specifically, the shortest paths. We test and demonstrate an alternative strategy for approximating distances by relaxing the constraint that the shortest path needs to be found, substituting the condition that any (reasonable) path between points on the graph suffices for purposes of this algorithm. Lastly we demonstrate how the proposed remediations/variations on the algorithm can be used to conduct dimensionality reduction on real-world data sets, and provide some conclusions and possible future reseach topics. |