noise-lookingvariables. We cast this problem as finding a low-dimensional projection of the data which is well-clustered. This yields a one-dimensional projection in the simplest situation with two clusters, and extends naturally to a multi-label scenario for more than two clusters. In this paper, (a) we first show that this joint clustering and dimension reduction formulation is equivalent to previously proposed discriminative clustering frameworks, thus leading to convex relaxations of the problem; (b) we propose a novel sparse extension, which is still cast as a convex relaxation and allows estimation in higher dimensions; (c) we propose a natural extension for the multi-label scenario; (d) we provide a new theoretical analysis of the performance of these formulations with a simple probabilistic model, leading to scalings over the form $d=O(\sqrt{n})$ for the affine invariant case and $d=O(n)$ for the sparse case, where $n$ is the number of examples and $d$ the ambient dimension; and finally, (e) we propose an efficient iterative algorithm with running-time complexity proportional to $O(nd^2)$, improving on earlier algorithms for discriminative clustering with the square loss, which had quadratic complexity in the number of examples.

In this paper we introduce a new optimization formulation for
sparse regression and compressed sensing, called CLOT (Combined
L-One and Two), wherein the regularizer is a convex combination
of the $\ell_1$- and $\ell_2$-norms. This formulation differs
from the Elastic Net (EN) formulation, in which the regularizer
is a convex combination of the $\ell_1$- and $\ell_2$-norm
*squared*. It is shown that, in the context of compressed
sensing, the EN formulation *does not achieve* robust
recovery of sparse vectors, whereas the new CLOT formulation
achieves robust recovery. Also, like EN but unlike LASSO, the
CLOT formulation achieves the grouping effect, wherein
coefficients of highly correlated columns of the measurement (or
design) matrix are assigned roughly comparable values. It is
already known LASSO does not have the grouping effect. Therefore
the CLOT formulation combines the best features of both LASSO
(robust sparse recovery) and EN (grouping effect).

The CLOT formulation is a special case of another one called SGL (Sparse Group LASSO) which was introduced into the literature previously, but without any analysis of either the grouping effect or robust sparse recovery. It is shown here that SGL achieves robust sparse recovery, and also achieves a version of the grouping effect in that coefficients of highly correlated columns belonging to the same group of the measurement (or design) matrix are assigned roughly comparable values.

We consider a firm that sells a large number of products to its customers in an online fashion. Each product is described by a high dimensional feature vector, and the market value of a product is assumed to be linear in the values of its features. Parameters of the valuation model are unknown and can change over time. The firm sequentially observes a product's features and can use the historical sales data (binary sale/no sale feedbacks) to set the price of current product, with the objective of maximizing the collected revenue. We measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance.

We propose a pricing policy based on projected stochastic gradient descent (PSGD) and characterize its regret in terms of time $T$, features dimension $d$, and the temporal variability in the model parameters, $\delta_t$. We consider two settings. In the first one, feature vectors are chosen antagonistically by nature and we prove that the regret of PSGD pricing policy is of order $O(\sqrt{T} + \sum_{t=1}^T \sqrt{t}\delta_t)$. In the second setting (referred to as stochastic features model), the feature vectors are drawn independently from an unknown distribution. We show that in this case, the regret of PSGD pricing policy is of order $O(d^2 \log T + \sum_{t=1}^T t\delta_t/d)$.

spiked-smoothclassifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping.

Kernel regression or classification (also referred to as
*weighted* $\epsilon$-NN methods in Machine Learning) are
appealing for their simplicity and therefore ubiquitous in data
analysis. However, practical implementations of kernel
regression or classification consist of quantizing or sub-
sampling data for improving time efficiency, often at the cost
of prediction quality. While such tradeoffs are necessary in
practice, their statistical implications are generally not well
understood, hence practical implementations come with few
performance guarantees. In particular, it is unclear whether it
is possible to maintain the statistical accuracy of kernel
prediction---crucial in some applications---while improving
prediction time.

The present work provides guiding principles for combining kernel prediction with data- quantization so as to guarantee good tradeoffs between prediction time and accuracy, and in particular so as to approximately maintain the good accuracy of vanilla kernel prediction.

Furthermore, our tradeoff guarantees are
worked out explicitly in terms of a tuning parameter which acts
as a *knob* that favors either time or accuracy depending
on practical needs. On one end of the knob, prediction time is
of the same order as that of *single* -nearest-neighbor
prediction (which is statistically inconsistent) while
maintaining consistency; on the other end of the knob, the
prediction risk is nearly minimax-optimal (in terms of the
original data size) while still reducing time complexity. The
analysis thus reveals the interaction between the data-
quantization approach and the kernel prediction method, and most
importantly gives explicit control of the tradeoff to the
practitioner rather than fixing the tradeoff in advance or
leaving it opaque. The theoretical results are validated on data
from a range of real-world application domains; in particular we
demonstrate that the theoretical *knob* performs as
expected.

Information diffusion in online social networks is affected
by the underlying network topology, but it also has the power to
change it. Online users are constantly creating new links when
exposed to new information sources, and in turn these links are
alternating the way information spreads. However, these two
highly intertwined stochastic processes, information diffusion
and network evolution, have been predominantly studied
*separately*, ignoring their co-evolutionary
dynamics.

We propose a temporal point process model, Coevolve, for such joint dynamics, allowing the intensity of one process to be modulated by that of the other. This model allows us to efficiently simulate interleaved diffusion and network events, and generate traces obeying common diffusion and network patterns observed in real-world networks such as Twitter. Furthermore, we also develop a convex optimization framework to learn the parameters of the model from historical diffusion and network evolution traces. We experimented with both synthetic data and data gathered from Twitter, and show that our model provides a good fit to the data as well as more accurate predictions than alternatives.

preference-based teachingand a corresponding complexity parameter---the preference-based teaching dimension (PBTD)---representing the worst-case number of examples needed to teach any concept in a given concept class. Although the PBTD coincides with the well- known recursive teaching dimension (RTD) on finite classes, it is radically different on infinite ones: the RTD becomes infinite already for trivial infinite classes (such as half- intervals) whereas the PBTD evaluates to reasonably small values for a wide collection of infinite classes including classes consisting of so-called closed sets w.r.t. a given closure operator, including various classes related to linear sets over $\mathbb{N}_0$ (whose RTD had been studied quite recently) and including the class of Euclidean half-spaces. On top of presenting these concrete results, we provide the reader with a theoretical framework (of a combinatorial flavor) which helps to derive bounds on the PBTD.

We present a canonical way to turn any smooth parametric
family of probability distributions on an arbitrary search space
$X$ into a continuous-time black-box optimization method on $X$,
the *information-geometric optimization* (IGO) method.
Invariance as a major design principle keeps the number of
arbitrary choices to a minimum. The resulting *IGO flow*
is the flow of an ordinary differential equation conducting the
natural gradient ascent of an adaptive, time-dependent
transformation of the objective function. It makes no particular
assumptions on the objective function to be optimized.

The IGO method produces explicit IGO algorithms through time discretization. It naturally recovers versions of known algorithms and offers a systematic way to derive new ones. In continuous search spaces, IGO algorithms take a form related to natural evolution strategies (NES). The cross-entropy method is recovered in a particular case with a large time step, and can be extended into a smoothed, parametrization-independent maximum likelihood update (IGO-ML). When applied to the family of Gaussian distributions on $\R^d$, the IGO framework recovers a version of the well-known CMA-ES algorithm and of xNES. For the family of Bernoulli distributions on $\{0,1\}^d$, we recover the seminal PBIL algorithm and cGA. For the distributions of restricted Boltzmann machines, we naturally obtain a novel algorithm for discrete optimization on $\{0,1\}^d$. All these algorithms are natural instances of, and unified under, the single information-geometric optimization framework.

The IGO method achieves, thanks to its intrinsic formulation, maximal invariance properties: invariance under reparametrization of the search space $X$, under a change of parameters of the probability distribution, and under increasing transformation of the function to be optimized. The latter is achieved through an adaptive, quantile-based formulation of the objective. Theoretical considerations strongly suggest that IGO algorithms are essentially characterized by a minimal change of the distribution over time. Therefore they have minimal loss in diversity through the course of optimization, provided the initial diversity is high. First experiments using restricted Boltzmann machines confirm this insight. As a simple consequence, IGO seems to provide, from information theory, an elegant way to simultaneously explore several valleys of a fitness landscape in a single run.

`imbalanced-learn`

is an open-source python toolbox
aiming at providing a wide range of methods to cope with the
problem of imbalanced dataset frequently encountered in machine
learning and pattern recognition. The implemented state-of-the-
art methods can be categorized into 4 groups: (i) under-
sampling, (ii) over-sampling, (iii) combination of over- and
under-sampling, and (iv) ensemble learning methods. The proposed
toolbox depends only on `numpy`

, `scipy`

,
and `scikit-learn`

and is distributed under MIT
license. Furthermore, it is fully compatible with ```
scikit-
learn
```

and is part of the ```
scikit-learn-
contrib
```

supported project. Documentation, unit tests as
well as integration tests are provided to ease usage and
contribution. Source code, binaries, and documentation can be
downloaded from
github.com/scikit-learn-contrib/imbalanced-learn.
focussed. Parametric families of patterns can be learnt from a very small number of examples. Stored temporal patterns can be content- addressed in ways that are analog to recalling static patterns in Hopfield networks.

debiasedor

desparsifiedlasso estimators. We show the approach converges at the same rate as the lasso as long as the dataset is not split across too many machines, and consistently estimates the support under weaker conditions than the lasso. On the computational side, we propose a new parallel and computationally-efficient algorithm to compute the approximate inverse covariance required in the debiasing approach, when the dataset is split across samples. We further extend the approach to generalized linear models.

out-of- the- boxfunctionality. Based on the Alternating Direction Method of Multipliers (ADMM), it is able to efficiently store, analyze, parallelize, and solve large optimization problems from a variety of different applications. Documentation, examples, and more can be found on the SnapVX website at snap.stanford.edu/snapvx.

globalhierarchical algorithms like bisecting $k$-means, and prove that popular divisive hierarchical algorithms produce clusterings that cannot be produced by any linkage-based algorithm.

mergestep that reduces sensitivity to hyperparameter settings. Survey data are typically acquired under an informative sampling design where the probability of inclusion depends on the surveyed response such that the distribution for the observed sample is different from the population. We extend the derivation of a penalized objective function to use a pseudo-posterior that incorporates sampling weights that

undothe informative design. We provide a simulation study to demonstrate that our approach produces unbiased estimation for the outlying cluster under informative sampling. The method is applied for outlier nomination for the Current Employment Statistics survey conducted by the Bureau of Labor Statistics.

proper composite loss, which is the composition of a proper loss and a link function. We extend existing results for binary losses to multiclass losses. We subsume results on “classification calibration” by relating it to properness. We determine the stationarity condition, Bregman representation, order- sensitivity, and quasi-convexity of multiclass proper losses. We then characterise the existence and uniqueness of the composite representation for multiclass losses. We show how the composite representation is related to other core properties of a loss: mixability, admissibility and (strong) convexity of multiclass losses which we characterise in terms of the Hessian of the Bayes risk. We show that the simple integral representation for binary proper losses can not be extended to multiclass losses but offer concrete guidance regarding how to design different loss functions. The conclusion drawn from these results is that the proper composite representation is a natural and convenient tool for the design of multiclass loss functions.

sketching matrix, $S\in\mathbb{R}^{r \times n}$, where $r \ll n$. Then, rather than solving the LS problem using the full data $(X,Y)$, sketching algorithms solve the LS problem using only the

sketched data$(SX, SY)$. Prior work has typically adopted an

smoothingof mutual information. We then introduce an efficiently computable consistent estimator of our population measure of dependence, and we empirically establish its equitability on a large class of noisy functional relationships. This new statistic has better bias/variance properties and better runtime complexity than a previous heuristic approach. Next, we derive a second, related statistic whose computation is a trivial side-product of our algorithm and whose goal is powerful independence testing rather than equitability. We prove that this statistic yields a consistent independence test and show in simulations that the test has good power against independence. Taken together, our results suggest that these two statistics are a valuable pair of tools for exploratory data analysis.

We study the problem of high-dimensional regression when there may be interacting variables. Approaches using sparsity-inducing penalty functions such as the Lasso can be useful for producing interpretable models. However, when the number variables runs into the thousands, and so even two-way interactions number in the millions, these methods may become computationally infeasible. Typically variable screening based on model fits using only main effects must be performed first. One problem with screening is that important variables may be missed if they are only useful for prediction when certain interaction terms are also present in the model.

To tackle this issue, we introduce a new method we call Backtracking. It can be incorporated into many existing high-dimensional methods based on penalty functions, and works by building increasing sets of candidate interactions iteratively. Models fitted on the main effects and interactions selected early on in this process guide the selection of future interactions. By also making use of previous fits for computation, as well as performing calculations is parallel, the overall run-time of the algorithm can be greatly reduced.

The effectiveness of our method when applied to regression and classification problems is demonstrated on simulated and real data sets. In the case of using Backtracking with the Lasso, we also give some theoretical support for our procedure.

Many data sets can be represented as a sequence of interactions between entities---for example communications between individuals in a social network, protein-protein interactions or DNA-protein interactions in a biological context, or vehicles' journeys between cities. In these contexts, there is often interest in making predictions about future interactions, such as who will message whom.

A popular approach to network modeling in a Bayesian context is to assume that the observed interactions can be explained in terms of some latent structure. For example, traffic patterns might be explained by the size and importance of cities, and social network interactions might be explained by the social groups and interests of individuals. Unfortunately, while elucidating this structure can be useful, it often does not directly translate into an effective predictive tool. Further, many existing approaches are not appropriate for sparse networks, a class that includes many interesting real-world situations.

In this paper, we develop models for sparse networks that combine structure elucidation with predictive performance. We use a Bayesian nonparametric approach, which allows us to predict interactions with entities outside our training set, and allows the both the latent dimensionality of the model and the number of nodes in the network to grow in expectation as we see more data. We demonstrate that we can capture latent structure while maintaining predictive power, and discuss possible extensions.

We investigate the online version of Principle Component
Analysis (PCA), where in each trial $t$ the learning algorithm
chooses a $k$-dimensional subspace, and upon receiving the next
instance vector $\x_t$, suffers the compression
loss

, which is the squared Euclidean distance between
this instance and its projection into the chosen subspace. When
viewed in the right parameterization, this compression loss is
linear, i.e. it can be rewritten as
$\text{tr}(\mathbf{W}_t\x_t\x_t^\top)$, where $\mathbf{W}_t$ is
the parameter of the algorithm and the outer product
$\x_t\x_t^\top$ (with $\|\x_t\|\le 1$) is the instance matrix.
In this paper generalize PCA to arbitrary positive definite
instance matrices $\mathbf{X}_t$ with the linear loss
$\text{tr}(\mathbf{W}_t\X_t)$.

We evaluate online algorithms in terms of their worst-case regret, which is a bound on the additional total loss of the online algorithm on all instances matrices over the compression loss of the best $k$-dimensional subspace (chosen in hindsight). We focus on two popular online algorithms for generalized PCA: the Gradient Descent (GD) and Matrix Exponentiated Gradient (MEG) algorithms. We show that if the regret is expressed as a function of the number of trials, then both algorithms are optimal to within a constant factor on worst-case sequences of positive definite instances matrices with trace norm at most one (which subsumes the original PCA problem with outer products). This is surprising because MEG is believed be suboptimal in this case. We also show that when considering regret bounds as a function of a loss budget, then MEG remains optimal and strictly outperforms GD when the instance matrices are trace norm bounded.

Next, we consider online PCA when the adversary is allowed to present the algorithm with positive semidefinite instance matrices whose largest eigenvalue is bounded (rather than their trace which is the sum of their eigenvalues). Again we can show that MEG is optimal and strictly better than GD in this setting.

`mlr`

package provides a generic, object-
oriented, and extensible framework for classification,
regression, survival analysis and clustering for the R language.
It provides a unified interface to more than 160 basic learners
and includes meta-algorithms and model selection techniques to
improve and extend the functionality of basic learners with,
e.g., hyperparameter tuning, feature selection, and ensemble
construction. Parallel high-performance computing is natively
supported. The package targets practitioners who want to quickly
apply machine learning algorithms, as well as researchers who
want to implement, benchmark, and compare their new methods in a
structured environment.
guaranteeddenotes that the region defined surely encloses weight sets that are global minimizers of the neural network's error function. Although the solution set to the bounding problem for an MLP is in general non-convex, the paper presents the theoretical results that help deriving a box which is a convex set. This box is an outer approximation of the algebraic solutions to the interval equations resulting from the function implemented by the network nodes. An experimental study using well known benchmarks is presented in accordance with the theoretical results.

`StabLe`

, a
structure learning algorithm based on MDC. We use simulated
datasets for five benchmark network topologies to empirically
demonstrate how `StabLe`

improves upon ordinary
least squares (OLS) regression. We also apply
`StabLe`

to microarray gene expression data for
lymphoblastoid cells from 727 individuals belonging to eight
global population groups. We establish that `StabLe`

improves test set performance relative to OLS via ten-fold
cross-validation. Finally, we develop `SGEX`

, a
method for quantifying differential expression of genes between
different population groups.
no-free-lunchrequirement, this mechanism is the one and only incentive-compatible payment mechanism possible. We also show that among all possible incentive- compatible mechanisms (that may or may not satisfy no-free- lunch), our mechanism makes the smallest possible payment to spammers. We further extend our results to a more general setting in which workers are required to provide a quantized confidence for each question. Interestingly, this unique mechanism takes a

multiplicativeform. The simplicity of the mechanism is an added benefit. In preliminary experiments involving over 900 worker-task pairs, we observe a significant drop in the error rates under this unique mechanism for the same or lower monetary expenditure.

Slow feature analysis (SFA) is an unsupervised learning algorithm that extracts slowly varying features from a multi- dimensional time series. Graph-based SFA (GSFA) is an extension to SFA for supervised learning that can be used to successfully solve regression problems if combined with a simple supervised post-processing step on a small number of slow features. The objective function of GSFA minimizes the squared output differences between pairs of samples specified by the edges of a structure called training graph. The edges of current training graphs, however, are derived only from the relative order of the labels. Exploiting the exact numerical value of the labels enables further improvements in label estimation accuracy.

In this article, we propose the exact label learning (ELL)
method to create a more precise training graph that encodes the
desired labels explicitly and allows GSFA to extract a
normalized version of them directly (i.e., without supervised
post- processing). The ELL method is used for three tasks: (1)
We estimate gender from artificial images of human faces
(regression) and show the advantage of coding additional labels,
particularly skin color. (2) We analyze two existing graphs for
regression. (3) We extract *compact* discriminative
features to classify traffic sign images. When the number of
output features is limited, such compact features provide a
higher classification rate compared to a graph that generates
features equivalent to those of nonlinear Fisher discriminant
analysis. The method is versatile, directly supports multiple
labels, and provides higher accuracy compared to current graphs
for the problems considered.

In sparse principal component analysis we are given noisy observations of a low-rank matrix of dimension $n\times p$ and seek to reconstruct it under additional sparsity assumptions. In particular, we assume here each of the principal components $v_1,\dots,v_r$ has at most $s_0$ non-zero entries. We are particularly interested in the high dimensional regime wherein $p$ is comparable to, or even much larger than $n$.

In an influential paper, Johnstone and Lu (2004) introduced a simple algorithm that estimates the support of the principal vectors $v_1,\dots,v_r$ by the largest entries in the diagonal of the empirical covariance. This method can be shown to identify the correct support with high probability if $s_0\le K_1\sqrt{n/\log p}$, and to fail with high probability if $s_0\ge K_2 \sqrt{n/\log p}$ for two constants $0$ < $K_1,K_2$ < $\infty$. Despite a considerable amount of work over the last ten years, no practical algorithm exists with provably better support recovery guarantees.

Here we analyze a covariance thresholding algorithm that was recently proposed by Krauthgamer, Nadler, Vilenchik, et al. (2015). On the basis of numerical simulations (for the rank-one case), these authors conjectured that covariance thresholding correctly recover the support with high probability for $s_0\le K\sqrt{n}$ (assuming $n$ of the same order as $p$). We prove this conjecture, and in fact establish a more general guarantee including higher-rank as well as $n$ much smaller than $p$. Recent lower bounds (Berthet and Rigollet, 2013; Ma and Wigderson, 2015) suggest that no polynomial time algorithm can do significantly better.

The key technical component of our analysis develops new bounds on the norm of kernel random matrices, in regimes that were not considered before. Using these, we also derive sharp bounds for estimating the population covariance, and the principal component (with $\ell_2$-loss).

Optimization on manifolds is a class of methods for optimization of an objective function, subject to constraints which are smooth, in the sense that the set of points which satisfy the constraints admits the structure of a differentiable manifold. While many optimization problems are of the described form, technicalities of differential geometry and the laborious calculation of derivatives pose a significant barrier for experimenting with these methods.

We introduce Pymanopt (available at pymanopt.github.io), a toolbox for optimization on manifolds, implemented in Python, that---similarly to the Manopt Matlab toolbox---implements several manifold geometries and optimization algorithms. Moreover, we lower the barriers to users further by using automated differentiation for calculating derivative information, saving users time and saving them from potential calculation and implementation errors.

This article describes a method for constructing a special rule (we call it synergy rule) that uses as its input information the outputs (scores) of several monotonic rules which solve the same pattern recognition problem. As an example of scores of such monotonic rules we consider here scores of SVM classifiers.

In order to construct the optimal synergy rule, we estimate the conditional probability function based on the direct problem setting, which requires solving a Fredholm integral equation. Generally, solving a Fredholm equation is an ill-posed problem. However, in our model, we look for the solution of the equation in the set of monotonic and bounded functions, which makes the problem well-posed. This allows us to solve the equation accurately even with training data sets of limited size.

In order to construct a monotonic solution, we use the set of functions that belong to Reproducing Kernel Hilbert Space (RKHS) associated with the INK-spline kernel (splines with Infinite Numbers of Knots) of degree zero. The paper provides details of the methods for finding multidimensional conditional probability in a set of monotonic functions to obtain the corresponding synergy rules. We demonstrate effectiveness of such rules for 1) solving standard pattern recognition problems, 2) constructing multi-class classification rules, 3) constructing a method for knowledge transfer from multiple intelligent teachers in the LUPI paradigm.

Patients with developmental disorders, such as autism spectrum disorder (ASD), present with symptoms that change with time even if the named diagnosis remains fixed. For example, language impairments may present as delayed speech in a toddler and difficulty reading in a school-age child. Characterizing these trajectories is important for early treatment. However, deriving these trajectories from observational sources is challenging: electronic health records only reflect observations of patients at irregular intervals and only record what factors are clinically relevant at the time of observation. Meanwhile, caretakers discuss daily developments and concerns on social media.

In this work, we present a fully unsupervised approach for learning disease trajectories from incomplete medical records and social media posts, including cases in which we have only a single observation of each patient. In particular, we use a dynamic topic model approach which embeds each disease trajectory as a path in $\mathbb{R}^D$. A Polya- gamma augmentation scheme is used to efficiently perform inference as well as incorporate multiple data sources. We learn disease trajectories from the electronic health records of 13,435 patients with ASD and the forum posts of 13,743 caretakers of children with ASD, deriving interesting clinical insights as well as good predictions.

`.mat`

files and the `.npy`

format for
numpy or plain text.
common senseknowledge reflected in the example traces to construct decision tree policies for goal-oriented factored POMDPs. More precisely, our algorithm (provably) succeeds at finding a policy for a given input goal when (1) there is a CNF that is almost always observed satisfied on the traces of the POMDP, capturing a sufficient

common sense rules.Such a CNF always exists for noisy STRIPS domains, for example. Our results thus essentially establish that the possession of a suitable

We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains.

The
approach implements this idea in the context of neural network
architectures that are trained on labeled data from the source
domain and unlabeled data from the target domain (no labeled
target-domain data is necessary). As the training progresses,
the approach promotes the emergence of features that are (i)
discriminative for the main learning task on the source domain
and (ii) indiscriminate with respect to the shift between the
domains. We show that this adaptation behaviour can be achieved
in almost any feed-forward model by augmenting it with few
standard layers and a new *gradient reversal* layer. The
resulting augmented architecture can be trained using standard
backpropagation and stochastic gradient descent, and can thus be
implemented with little effort using any of the deep learning
packages.

We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.

Q-learning (QL) is a popular reinforcement learning algorithm
that is guaranteed to converge to optimal policies in Markov
decision processes. However, QL exhibits an artifact: in
expectation, the effective rate of updating the value of an
action depends on the probability of choosing that action. In
other words, there is a tight coupling between the learning
dynamics and underlying execution policy. This coupling can
cause performance degradation in noisy *non-stationary*
environments.

Here, we introduce Repeated Update Q-learning (RUQL), a learning algorithm that resolves the undesirable artifact of Q-learning while maintaining simplicity. We analyze the similarities and differences between RUQL, QL, and the closest state-of-the-art algorithms theoretically. Our analysis shows that RUQL maintains the convergence guarantee of QL in stationary environments, while relaxing the coupling between the execution policy and the learning dynamics. Experimental results confirm the theoretical insights and show how RUQL outperforms both QL and the closest state-of-the-art algorithms in noisy non-stationary environments.

A unified view on multi-class support vector machines (SVMs) is presented, covering most prominent variants including the one- vs-all approach and the algorithms proposed by Weston & Watkins, Crammer & Singer, Lee, Lin, & Wahba, and Liu & Yuan. The unification leads to a template for the quadratic training problems and new multi-class SVM formulations. Within our framework, we provide a comparative analysis of the various notions of multi-class margin and margin-based loss. In particular, we demonstrate limitations of the loss function considered, for instance, in the Crammer & Singer machine.

We analyze Fisher consistency of multi- class loss functions and universal consistency of the various machines. On the one hand, we give examples of SVMs that are, in a particular hyperparameter regime, universally consistent without being based on a Fisher consistent loss. These include the canonical extension of SVMs to multiple classes as proposed by Weston & Watkins and Vapnik as well as the one-vs-all approach. On the other hand, it is demonstrated that machines based on Fisher consistent loss functions can fail to identify proper decision boundaries in low-dimensional feature spaces.

We compared the performance of nine different multi-class SVMs in a thorough empirical study. Our results suggest to use the Weston & Watkins SVM, which can be trained comparatively fast and gives good accuracies on benchmark functions. If training time is a major concern, the one-vs-all approach is the method of choice.

`CauseEffectPairs`

that
consists of data for 100 different cause-effect pairs selected
from 37 data sets from various domains (e.g., meteorology,
biology, medicine, engineering, economy, etc.) and motivate our
decisions regarding the “ground truth” causal directions of all
pairs. We evaluate the performance of several bivariate causal
discovery methods on these real-world benchmark data and in
addition on artificially simulated data. Our empirical results
on real-world data indicate that certain methods are indeed able
to distinguish cause from effect using only purely observational
data, although more benchmark data would be needed to obtain
statistically significant conclusions. One of the best
performing methods overall is the method based on Additive Noise
Models that has originally been proposed by Hoyer et al. (2009),
which obtains an accuracy of 63 $\pm$ 10 % and an AUC of 0.74
$\pm$ 0.05 on the real-world benchmark. As the main theoretical
contribution of this work we prove the consistency of that
method.
We consider two closely related problems: planted clustering and submatrix localization. In the planted clustering problem, a random graph is generated based on an underlying cluster structure of the nodes; the task is to recover these clusters given the graph. The submatrix localization problem concerns locating hidden submatrices with elevated means inside a large real-valued random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. These formulations cover several classical models such as planted clique, planted densest subgraph, planted partition, planted coloring, and the stochastic block model, which are widely used for studying community detection, graph clustering and bi-clustering.

For both problems, we show that the space of the model
parameters (cluster/submatrix size, edge probabilities and the
mean of the submatrices) can be partitioned into four disjoint
regions corresponding to decreasing statistical and
computational complexities: (1) the *impossible* regime,
where all algorithms fail; (2) the *hard* regime, where the
computationally expensive Maximum Likelihood Estimator (MLE)
succeeds; (3) the *easy* regime, where the polynomial-time
convexified MLE succeeds; (4) the *simple* regime, where a
local counting/thresholding procedure succeeds. Moreover, we
show that each of these algorithms provably fails in the harder
regimes.

Our results establish the minimax recovery limits, which are tight up to universal constants and hold even with a growing number of clusters/submatrices, and provide order-wise stronger performance guarantees for polynomial-time algorithms than previously known. Our study demonstrates the tradeoffs between statistical and computational considerations, and suggests that the minimax limits may not be achievable by polynomial-time algorithms.

High-dimensional datasets are well-approximated by low- dimensional structures. Over the past decade, this empirical observation motivated the investigation of detection, measurement, and modeling techniques to exploit these low- dimensional intrinsic structures, yielding numerous implications for high-dimensional statistics, machine learning, and signal processing. Manifold learning (where the low-dimensional structure is a manifold) and dictionary learning (where the low- dimensional structure is the set of sparse linear combinations of vectors from a finite dictionary) are two prominent theoretical and computational frameworks in this area. Despite their ostensible distinction, the recently-introduced Geometric Multi-Resolution Analysis (GMRA) provides a robust, computationally efficient, multiscale procedure for simultaneously learning manifolds and dictionaries.

In this work, we prove non-asymptotic probabilistic bounds on the approximation error of GMRA for a rich class of data-generating statistical models that includes “noisy” manifolds, thereby establishing the theoretical robustness of the procedure and confirming empirical observations. In particular, if a dataset aggregates near a low- dimensional manifold, our results show that the approximation error of the GMRA is completely independent of the ambient dimension. Our work therefore establishes GMRA as a provably fast algorithm for dictionary learning with approximation and sparsity guarantees. We include several numerical experiments confirming these theoretical results, and our theoretical framework provides new tools for assessing the behavior of manifold learning and dictionary learning procedures on a large class of interesting models.

`print/plot/predict`

methods are available;
(b) dedicated methods for trees with We consider the problem of approximating and learning disjunctions (or equivalently, conjunctions) on symmetric distributions over $\zo^n$. Symmetric distributions are distributions whose PDF is invariant under any permutation of the variables. We prove that for every symmetric distribution $\mathcal{D}$, there exists a set of $n^{O(\log{(1/\epsilon)})}$ functions $\mathbb{S}$, such that for every disjunction $c$, there is function $p$, expressible as a linear combination of functions in $\mathbb{S$,} such that $p$ $\epsilon$-approximates $c$ in $\ell_1$ distance on $\mathcal{D}$ or $\mathbf{E}_{x \sim \mathcal{D}}[ |c(x)-p(x)|] \leq \epsilon$. This implies an agnostic learning algorithm for disjunctions on symmetric distributions that runs in time $n^{O( \log{(1/\epsilon)})}$. The best known previous bound is $n^{O(1/\epsilon^4)}$ and follows from approximation of the more general class of halfspaces (Wimmer, 2010). We also show that there exists a symmetric distribution $\mathcal{D}$, such that the minimum degree of a polynomial that $1/3$-approximates the disjunction of all $n$ variables in $\ell_1$ distance on $\mathcal{D}$ is $\Omega(\sqrt{n})$. Therefore the learning result above cannot be achieved via $\ell_1$-regression with a polynomial basis used in most other agnostic learning algorithms.

Our technique also gives a simple proof that for any product distribution $\mathcal{D}$ and every disjunction $c$, there exists a polynomial $p$ of degree $O(\log{(1/\epsilon)})$ such that $p$ $\epsilon$-approximates $c$ in $\ell_1$ distance on $\mathcal{D}$. This was first proved by Blais et al. (2008) via a more involved argument.

- when the dropout-regularized criterion has a unique minimizer,
- when the dropout- regularization penalty goes to infinity with the weights, and when it remains bounded,
- that the dropout regularization can be non- monotonic as individual weights increase from 0, and
- that the dropout regularization penalty may
*not*be convex.

`softImpute`

in R for implementing our
approaches, and a distributed version for very large matrices
using the `Spark`

cluster programming environment
When learning a directed acyclic graph (DAG) model via observational data, one generally cannot identify the underlying DAG, but can potentially obtain a Markov equivalence class. The size (the number of DAGs) of a Markov equivalence class is crucial to infer causal effects or to learn the exact causal DAG via further interventions. Given a set of Markov equivalence classes, the distribution of their sizes is a key consideration in developing learning methods. However, counting the size of an equivalence class with many vertices is usually computationally infeasible, and the existing literature reports the size distributions only for equivalence classes with ten or fewer vertices.

In this paper, we develop a method to compute the size of a Markov equivalence class. We first show that there are five types of Markov equivalence classes whose sizes can be formulated as five functions of the number of vertices respectively. Then we introduce a new concept of a rooted sub- class. The graph representations of rooted subclasses of a Markov equivalence class are used to partition this class recursively until the sizes of all rooted sub-classes can be computed via the five functions. The proposed size counting is efficient for Markov equivalence classes of sparse DAGs with hundreds of vertices. Finally, we explore the size and edge distributions of Markov equivalence classes and find experimentally that, in general, (1) most Markov equivalence classes are half completed and their average sizes are small, and (2) the sizes of sparse classes grow approximately exponentially with the numbers of vertices.

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: it starts with all coefficients equal to zero, and iteratively updates the coefficient (by a small amount $\epsilon$) of the variable that achieves the maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it is known that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size $\epsilon$ goes to zero. Furthermore, essentially the same equivalence holds outside of least squares regression, with the minimization of a differentiable convex loss function subject to an $\ell_1$ norm constraint (the stagewise algorithm now updates the coefficient corresponding to the maximal absolute component of the gradient).

Even when they do not match their $\ell_1$-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the $\ell_1$ norm and sparsity? The current paper is an attempt to do just this. We present a general framework for stagewise estimation, which yields fast algorithms for problems such as group- structured learning, matrix completion, image denoising, and more.

We consider the query and computational complexity of learning multiplicity tree automata in Angluin's exact learning model. In this model, there is an oracle, called the Teacher, that can answer membership and equivalence queries posed by the Learner. Motivated by this feature, we first characterise the complexity of the equivalence problem for multiplicity tree automata, showing that it is logspace equivalent to polynomial identity testing.

We then move to query complexity, deriving lower bounds on the number of queries needed to learn multiplicity tree automata over both fixed and arbitrary fields. In the latter case, the bound is linear in the size of the target automaton. The best known upper bound on the query complexity over arbitrary fields derives from an algorithm of Habrard and Oncina (2006), in which the number of queries is proportional to the size of the target automaton and the size of a largest counterexample, represented as a tree, that is returned by the Teacher. However, a smallest counterexample tree may already be exponential in the size of the target automaton. Thus the above algorithm has query complexity exponentially larger than our lower bound, and does not run in time polynomial in the size of the target automaton. We give a new learning algorithm for multiplicity tree automata in which counterexamples to equivalence queries are represented as DAGs. The query complexity of this algorithm is quadratic in the target automaton size and linear in the size of a largest counterexample. In particular, if the Teacher always returns DAG counterexamples of minimal size then the query complexity is quadratic in the target automaton size---almost matching the lower bound, and improving the best previously-known algorithm by an exponential factor.

Motivated by problems in insurance, our task is to predict finite upper bounds on a future draw from an unknown distribution $p$ over natural numbers. We can only use past observations generated independently and identically distributed according to $p$. While $p$ is unknown, it is known to belong to a given collection $\mathcal{P}$ of probability distributions on the natural numbers.

The support of the distributions $p
\in \mathcal{P}$ may be unbounded, and the prediction game goes
on for *infinitely* many draws. We are allowed to make
observations without predicting upper bounds for some time. But
we must, with probability $1$, start and then continue to
predict upper bounds after a finite time irrespective of which
$p \in \mathcal{P}$ governs the data.

If it is possible,
without knowledge of $p$ and for any prescribed confidence
however close to $1$, to come up with a sequence of upper bounds
that is never violated over an infinite time window with
confidence at least as big as prescribed, we say the model class
$\mathcal{P}$ is *insurable*. We completely characterize
the insurability of any class $\mathcal{P}$ of distributions
over natural numbers by means of a condition on how the
neighborhoods of distributions in $\mathcal{P}$ should be, one
that is both necessary and sufficient.

`camel`

implementing the proposed method is
available on the Comprehensive R Archive Network cran.r-project.org/web/
packages/camel.
Stochastic multiplicity automata (SMA) are weighted finite automata that generalize probabilistic automata. They have been used in the context of probabilistic grammatical inference. Observable operator models (OOMs) are a generalization of hidden Markov models, which in turn are models for discrete-valued stochastic processes and are used ubiquitously in the context of speech recognition and bio-sequence modeling. Predictive state representations (PSRs) extend OOMs to stochastic input-output systems and are employed in the context of agent modeling and planning.

We present SMA, OOMs, and PSRs under the common framework of sequential systems, which are an algebraic characterization of multiplicity automata, and examine the precise relationships between them. Furthermore, we establish a unified approach to learning such models from data. Many of the learning algorithms that have been proposed can be understood as variations of this basic learning scheme, and several turn out to be closely related to each other, or even equivalent.

In many applications, one has side information, *e.g.*,
labels that are provided in a semi-supervised manner, about a
specific target region of a large data set, and one wants to
perform machine learning and data analysis tasks "nearby" that
prespecified target region. For example, one might be interested
in the clustering structure of a data graph near a prespecified
"seed set" of nodes, or one might be interested in finding
partitions in an image that are near a prespecified "ground
truth" set of pixels. Locally-biased problems of this sort are
particularly challenging for popular eigenvector-based machine
learning and data analysis tools. At root, the reason is that
eigenvectors are inherently global quantities, thus limiting the
applicability of eigenvector-based methods in situations where
one is interested in very local properties of the data.

In this paper, we address this issue by providing a
methodology to construct *semi-supervised eigenvectors* of
a graph Laplacian, and we illustrate how these locally-biased
eigenvectors can be used to perform *locally-biased machine
learning*. These semi-supervised eigenvectors capture
successively-orthogonalized directions of maximum variance,
conditioned on being well-correlated with an input seed set of
nodes that is assumed to be provided in a semi-supervised
manner. We show that these semi-supervised eigenvectors can be
computed quickly as the solution to a system of linear
equations; and we also describe several variants of our basic
method that have improved scaling properties. We provide several
empirical examples demonstrating how these semi-supervised
eigenvectors can be used to perform locally-biased learning; and
we discuss the relationship between our results and recent
machine learning algorithms that use global eigenvectors of the
graph Laplacian.

Fitting high-dimensional statistical models often requires
the use of non-linear parameter estimation procedures. As a
consequence, it is generally impossible to obtain an exact
characterization of the probability distribution of the
parameter estimates. This in turn implies that it is extremely
challenging to quantify the *uncertainty* associated with a
certain parameter estimate. Concretely, no commonly accepted
procedure exists for computing classical measures of uncertainty
and statistical significance as confidence intervals or
$p$-values for these models.

We consider here high- dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and $p$-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power.

Our approach is based on constructing a `de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high- throughput genomic data set about riboflavin production rate, made publicly available by Bühlmann et al. (2014).

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time- invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms.

We consider similarity information in the setting
of *contextual bandits*, a natural extension of the basic
MAB problem where before each round an algorithm is given the
*context*---a hint about the payoffs in this round.
Contextual bandits are directly motivated by placing
advertisements on web pages, one of the crucial problems in
sponsored search. A particularly simple way to represent
similarity information in the contextual bandit setting is via a
*similarity distance* between the context- arm pairs which
bounds from above the difference between the respective expected
payoffs.

Prior work on contextual bandits with similarity
uses “uniform" partitions of the similarity space, so that each
context-arm pair is approximated by the closest pair in the
partition. Algorithms based on “uniform" partitions disregard
the structure of the payoffs and the context arrivals, which is
potentially wasteful. We present algorithms that are based on
*adaptive* partitions, and take advantage of "benign"
payoffs and context arrivals without sacrificing the worst-case
performance. The central idea is to maintain a finer partition
in high-payoff regions of the similarity space and in popular
regions of the context space. Our results apply to several other
settings, e.g., MAB with constrained temporal change (Slivkins
and Upfal, 2008) and sleeping bandits (Kleinberg et al.,
2008a).

We give novel algorithms for stochastic strongly-convex optimization in the gradient oracle model which return a $O(\frac{1}{T})$-approximate solution after $T$ iterations. The first algorithm is deterministic, and achieves this rate via gradient updates and historical averaging. The second algorithm is randomized, and is based on pure gradient steps with a random step size.

This rate of convergence is optimal in the gradient oracle model. This improves upon the previously known best rate of $O(\frac{\log(T)}{T})$, which was obtained by applying an online strongly-convex optimization algorithm with regret $O(\log(T))$ to the batch setting.

We complement this result by proving that any algorithm has expected regret of $\Omega(\log(T))$ in the online stochastic strongly-convex optimization setting. This shows that any online-to-batch conversion is inherently suboptimal for stochastic strongly- convex optimization. This is the first formal evidence that online convex optimization is strictly more difficult than batch stochastic convex optimization.

Optimization on manifolds is a rapidly developing branch of nonlinear optimization. Its focus is on problems where the smooth geometry of the search space can be leveraged to design efficient numerical algorithms. In particular, optimization on manifolds is well-suited to deal with rank and orthogonality constraints. Such structured constraints appear pervasively in machine learning applications, including low-rank matrix completion, sensor network localization, camera network registration, independent component analysis, metric learning, dimensionality reduction and so on.

The Manopt toolbox, available at www.manopt.org, is a user-friendly, documented piece of software dedicated to simplify experimenting with state of the art Riemannian optimization algorithms. By dealing internally with most of the differential geometry, the package aims particularly at lowering the entrance barrier.