## On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

** Petros Drineas, Michael W. Mahoney**; 6(72):2153−2175, 2005.

### Abstract

A problem for many kernel-based methods is that the amount of computation
required to find the solution scales as *O(n ^{3})*,
where

*n*is the number of training examples. We develop and analyze an algorithm to compute an easily-interpretable low-rank approximation to an

*n × n*Gram matrix

*G*such that computations of interest may be performed more rapidly. The approximation is of the form

*, where*

^{~}G_{k}= CW_{k}^{+}C^{T}*C*is a matrix consisting of a small number

*c*of columns of

*G*and

*W*is the best rank-

_{k}*k*approximation to

*W*, the matrix formed by the intersection between those

*c*columns of

*G*and the corresponding

*c*rows of

*G*. An important aspect of the algorithm is the probability distribution used to randomly sample the columns; we will use a judiciously-chosen and data-dependent nonuniform probability distribution. Let ||·||

_{2}and ||·||

_{F}denote the spectral norm and the Frobenius norm, respectively, of a matrix, and let

*G*be the best rank-

_{k}*k*approximation to

*G*. We prove that by choosing

*O(k/ε*columns

^{4})*G-CW*||

_{k}^{+}C^{T}_{ξ}≤ ||

*G-G*||

_{k}_{ξ}+ ε Σ

*,*

_{i=1}^{n}G_{ii}^{2}
both in expectation and with high probability, for both *ξ = 2*, *F*,
and for all *k: 0 ≤ k ≤* rank*(W)*.
This approximation can be computed using *O(n)* additional space and time,
after making two passes over the data from external storage.
The relationships between this algorithm, other related matrix decompositions,
and the Nyström method from integral equation theory are discussed.

© JMLR 2005. (edit, beta) |