On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

Petros Drineas; Michael W. Mahoney

On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

Petros Drineas, Michael W. Mahoney; 6(72):2153−2175, 2005.

Abstract

A problem for many kernel-based methods is that the amount of computation required to find the solution scales as O(n³), where n is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an n × n Gram matrix G such that computations of interest may be performed more rapidly. The approximation is of the form ^~G_k = CW_k⁺C^T, where C is a matrix consisting of a small number c of columns of G and W_k is the best rank-k approximation to W, the matrix formed by the intersection between those c columns of G and the corresponding c rows of G. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let ||·||₂ and ||·||_F denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let G_k be the best rank-k approximation to G. We prove that by choosing O(k/ε⁴) columns

||G-CW_k⁺C^T||_ξ ≤ ||G-G_k||_ξ + ε Σ_i=1ⁿ G_ii² ,

both in expectation and with high probability, for both ξ = 2, F, and for all k: 0 ≤ k ≤ rank(W). This approximation can be computed using O(n) additional space and time, after making two passes over the data from external storage. The relationships between this algorithm, other related matrix decompositions, and the Nyström method from integral equation theory are discussed.