## Active Clustering of Biological Sequences

** Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, Yu Xia**; 13(7):203−225, 2012.

### Abstract

Given a point set *S* and an unknown metric *d* on *S*, we study the problem of efficiently partitioning *S* into *k* clusters while querying few distances between the points. In our model we assume that we have access to *one versus all* queries that given a point *s ∈ S* return the distances between *s* and all other points. We show that given a natural assumption about the structure of the instance, we can efficiently find an accurate clustering using only *O(k)* distance queries. Our algorithm uses an *active* selection strategy to choose a small set of points that we call landmarks, and considers only the distances between landmarks and other points to produce a clustering. We use our procedure to cluster proteins by sequence similarity. This setting nicely fits our model because we can use a fast sequence database search program to query a sequence against an entire data set. We conduct an empirical study that shows that even though we query a small fraction of the distances between the points, we produce clusterings that are close to a desired clustering given by manual classification.

© JMLR 2012. (edit, beta) |