JMLR

http://www.jmlr.org JMLR Journal of Machine Learning Research More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity http://jmlr.org/papers/v25/23-1360.html http://jmlr.org/papers/volume25/23-1360/23-1360.pdf 2024 Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund In this paper, we present new high-probability PAC-Bayes bounds for different types of losses. Firstly, for losses with a bounded range, we recover a strengthened version of Catoni's bound that holds uniformly for all parameter values. This leads to new fast-rate and mixed-rate bounds that are interpretable and tighter than previous bounds in the literature. In particular, the fast-rate bound is equivalent to the Seeger--Langford bound. Secondly, for losses with more general tail behaviors, we introduce two new parameter-free bounds: a PAC-Bayes Chernoff analogue when the loss' cumulative generating function is bounded, and a bound when the loss' second moment is bounded. These two bounds are obtained using a new technique based on a discretization of the space of possible events for the "in probability" parameter optimization problem. This technique is both simpler and more general than previous approaches optimizing over a grid on the parameters' space. Finally, using a simple technique that is applicable to any existing bound, we extend all previous results to anytime-valid bounds. Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space http://jmlr.org/papers/v25/23-1225.html http://jmlr.org/papers/volume25/23-1225/23-1225.pdf 2024 Zhengdao Chen To characterize the function space explored by neural networks (NNs) is an important aspect of learning theory. In this work, noticing that a multi-layer NN generates implicitly a hierarchy of reproducing kernel Hilbert spaces (RKHSs) -named a neural Hilbert ladder (NHL) - we define the function space as an infinite union of RKHSs, which generalizes the existing Barron space theory of two-layer NNs. We then establish several theoretical properties of the new space. First, we prove a correspondence between functions expressed by L-layer NNs and those belonging to L-level NHLs. Second, we prove generalization guarantees for learning an NHL with a controlled complexity measure. Third, we derive a non-Markovian dynamics of random fields that governs the evolution of the NHL which is induced by the training of multi-layer NNs in an infinite-width mean-field limit. Fourth, we show examples of depth separation in NHLs under the ReLU activation function. Finally, we perform numerical experiments to illustrate the feature learning aspect of NN training through the lens of NHLs. QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration http://jmlr.org/papers/v25/23-1027.html http://jmlr.org/papers/volume25/23-1027/23-1027.pdf 2024 Felix Chalumeau, Bryan Lim, Raphaël Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, Antoine Cully QDax is an open-source library with a streamlined and modular API for Quality-Diversity (QD) optimisation algorithms in Jax. The library serves as a versatile tool for optimisation purposes, ranging from black-box optimisation to continuous control. QDax offers implementations of popular QD, Neuroevolution, and Reinforcement Learning (RL) algorithms, supported by various examples. All the implementations can be just-in-time compiled with Jax, facilitating efficient execution across multiple accelerators, including GPUs and TPUs. These implementations effectively demonstrate the framework's flexibility and user-friendliness, easing experimentation for research purposes. Furthermore, the library is thoroughly documented and has 93% test coverage. Random Forest Weighted Local Fr{{\'e}}chet Regression with Random Objects http://jmlr.org/papers/v25/23-0811.html http://jmlr.org/papers/volume25/23-0811/23-0811.pdf 2024 Rui Qiu, Zhou Yu, Ruoqing Zhu Statistical analysis is increasingly confronted with complex data from metric spaces. Petersen and Müller (2019) established a general paradigm of Fréchet regression with complex metric space valued responses and Euclidean predictors. However, the local approach therein involves nonparametric kernel smoothing and suffers from the curse of dimensionality. To address this issue, we in this paper propose a novel random forest weighted local Fréchet regression paradigm. The main mechanism of our approach relies on a locally adaptive kernel generated by random forests. Our first method uses these weights as the local average to solve the conditional Fréchet mean, while the second method performs local linear Fréchet regression, both significantly improving existing Fréchet regression methods. Based on the theory of infinite order U-processes and infinite order $M_{m_n}$-estimator, we establish the consistency, rate of convergence, and asymptotic normality for our local constant estimator, which covers the current large sample theory of random forests with Euclidean responses as a special case. Numerical studies show the superiority of our methods with several commonly encountered types of responses such as distribution functions, symmetric positive-definite matrices, and sphere data. The practical merits of our proposals are also demonstrated through the application to New York taxi data and human mortality data. PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design http://jmlr.org/papers/v25/23-0680.html http://jmlr.org/papers/volume25/23-0680/23-0680.pdf 2024 Alexandre Duval, Victor Schmidt, Santiago Miret, Yoshua Bengio, Alex Hernández-García, David Rolnick Mitigating the climate crisis requires a rapid transition towards lower-carbon energy. Catalyst materials play a crucial role in the electrochemical reactions involved in numerous industrial processes key to this transition, such as renewable energy storage and electrofuel synthesis. To reduce the energy spent on such activities, we must quickly discover more efficient catalysts to drive electrochemical reactions. Machine learning (ML) holds the potential to efficiently model materials properties from large amounts of data, accelerating electrocatalyst design. The Open Catalyst Project OC20 dataset was constructed to that end. However, ML models trained on OC20 are still neither scalable nor accurate enough for practical applications. In this paper, we propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy. This includes improvements in (1) the graph creation step, (2) atom representations, (3) the energy prediction head, and (4) the force prediction head. We describe these contributions, referred to as PhAST, and evaluate them thoroughly on multiple architectures. Overall, PhAST improves energy MAE by 4 to 42% while dividing compute time by 3 to 8× depending on the targeted task/model. PhAST also enables CPU training, leading to 40× speedups in highly parallelized settings. Python package: https://phast.readthedocs.io. Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need? http://jmlr.org/papers/v25/23-0570.html http://jmlr.org/papers/volume25/23-0570/23-0570.pdf 2024 Roel Bouman, Zaharah Bukhsh, Tom Heskes In this study we evaluate 33 unsupervised anomaly detection algorithms on 52 real-world multivariate tabular data sets, performing the largest comparison of unsupervised anomaly detection algorithms to date. On this collection of data sets, the EIF (Extended Isolation Forest) algorithm significantly outperforms the most other algorithms. Visualizing and then clustering the relative performance of the considered algorithms on all data sets, we identify two clear clusters: one with "local” data sets, and another with "global” data sets. "Local” anomalies occupy a region with low density when compared to nearby samples, while "global” occupy an overall low density region in the feature space. On the local data sets the $k$NN ($k$-nearest neighbor) algorithm comes out on top. On the global data sets, the EIF (extended isolation forest) algorithm performs the best. Also taking into consideration the algorithms' computational complexity, a toolbox with these two unsupervised anomaly detection algorithms suffices for finding anomalies in this representative collection of multivariate data sets. By providing access to code and data sets, our study can be easily reproduced and extended with more algorithms and/or data sets. Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data http://jmlr.org/papers/v25/23-0121.html http://jmlr.org/papers/volume25/23-0121/23-0121.pdf 2024 Vasilii Feofanov, Emilie Devijver, Massih-Reza Amini In this paper, we propose a probabilistic framework for analyzing a multi-class majority vote classifier in the case where training data is partially labeled. First, we derive a multi-class transductive bound over the risk of the majority vote classifier, which is based on the classifier's vote distribution over each class. Then, we introduce a mislabeling error model to analyze the error of the majority vote classifier in the case of the pseudo-labeled training data. We derive a generalization bound over the majority vote error when imperfect labels are given, taking into account the mean and the variance of the prediction margin. Finally, we demonstrate an application of the derived transductive bound for self-training to find automatically the confidence threshold used to determine unlabeled examples for pseudo-labeling. Empirical results on different data sets show the effectiveness of our framework compared to several state-of-the-art semi-supervised approaches. Information Processing Equalities and the Information–Risk Bridge http://jmlr.org/papers/v25/22-0988.html http://jmlr.org/papers/volume25/22-0988/22-0988.pdf 2024 Robert C. Williamson, Zac Cranko We introduce two new classes of measures of information for statistical experiments which generalise and subsume φ-divergences, integral probability metrics, N-distances (MMD), and (f,Γ) divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational φ-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical information processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization. Nonparametric Regression for 3D Point Cloud Learning http://jmlr.org/papers/v25/22-0735.html http://jmlr.org/papers/volume25/22-0735/22-0735.pdf 2024 Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Li Wang, Ming-Jun Lai In recent years, there has been an exponentially increased amount of point clouds collected with irregular shapes in various areas. Motivated by the importance of solid modeling for point clouds, we develop a novel and efficient smoothing tool based on multivariate splines over the triangulation to extract the underlying signal and build up a 3D solid model from the point cloud. The proposed method can denoise or deblur the point cloud effectively, provide a multi-resolution reconstruction of the actual signal, and handle sparse and irregularly distributed point clouds to recover the underlying trajectory. In addition, our method provides a natural way of numerosity data reduction. We establish the theoretical guarantees of the proposed method, including the convergence rate and asymptotic normality of the estimator, and show that the convergence rate achieves optimal nonparametric convergence. We also introduce a bootstrap method to quantify the uncertainty of the estimators. Through extensive simulation studies and a real data example, we demonstrate the superiority of the proposed method over traditional smoothing methods in terms of estimation accuracy and efficiency of data reduction. AMLB: an AutoML Benchmark http://jmlr.org/papers/v25/22-0493.html http://jmlr.org/papers/volume25/22-0493/22-0493.pdf 2024 Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren Comparing different AutoML frameworks is notoriously challenging and often done incorrectly. We introduce an open and extensible benchmark that follows best practices and avoids common mistakes when comparing AutoML frameworks. We conduct a thorough comparison of 9 well-known AutoML frameworks across 71 classification and 33 regression tasks. The differences between the AutoML frameworks are explored with a multi-faceted analysis, evaluating model accuracy, its trade-offs with inference time, and framework failures. We also use Bradley-Terry trees to discover subsets of tasks where the relative AutoML framework rankings differ. The benchmark comes with an open-source tool that integrates with many AutoML frameworks and automates the empirical evaluation process end-to-end: from framework installation and resource allocation to in-depth evaluation. The benchmark uses public data sets, can be easily extended with other AutoML frameworks and tasks, and has a website with up-to-date results. Materials Discovery using Max K-Armed Bandit http://jmlr.org/papers/v25/22-0186.html http://jmlr.org/papers/volume25/22-0186/22-0186.pdf 2024 Nobuaki Kikkawa, Hiroshi Ohno Search algorithms for bandit problems are applicable in materials discovery. However, objectives of the conventional bandit problem are different from those of materials discovery. The conventional bandit problem aims to maximize the total rewards, whereas materials discovery aims to achieve breakthroughs in material properties. The max $K$-armed bandit (MKB) problem, which aims to acquire the single best reward, matches with the discovery tasks better than the conventional bandit. However, typical MKB algorithms are not directly applicable to materials discovery due to some difficulties. The typical algorithms have many hyperparameters and some difficulty in the directly implementation for the materials discovery. Thus, we propose a new MKB algorithm using an upper confidence bound of expected improvement of the best reward. This approach is guaranteed to be asymptotic to greedy oracles, which does not depend on the time horizon. In addition, compared with other MKB algorithms, the proposed algorithm has only one hyperparameter, which is advantageous in materials discovery. We applied the proposed algorithm to synthetic problems and molecular-design demonstrations using a Monte Carlo tree search. According to the results, the proposed algorithm stably outperformed other bandit algorithms in the late stage of the search process, unless the optimal arm coincides in the MKB and conventional bandit settings. Semi-supervised Inference for Block-wise Missing Data without Imputation http://jmlr.org/papers/v25/21-1504.html http://jmlr.org/papers/volume25/21-1504/21-1504.pdf 2024 Shanshan Song, Yuanyuan Lin, Yong Zhou We consider statistical inference for single or low-dimensional parameters in a high-dimensional linear model under a semi-supervised setting, wherein the data are a combination of a labelled block-wise missing data set of a relatively small size and a large unlabelled data set. The proposed method utilises both labelled and unlabelled data without any imputation or removal of the missing observations. The asymptotic properties of the estimator are established under regularity conditions. Hypothesis testing for low-dimensional coefficients are also studied. Extensive simulations are conducted to examine the theoretical results. The method is evaluated on the Alzheimer’s Disease Neuroimaging Initiative data. Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization http://jmlr.org/papers/v25/21-0748.html http://jmlr.org/papers/volume25/21-0748/21-0748.pdf 2024 Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou We investigate online convex optimization in non-stationary environments and choose dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the path length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is $\mathcal{O}(\sqrt{T(1+P_T)})$. Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the guarantee for some easy problem instances, particularly when online functions are smooth. Specifically, we introduce novel online algorithms that can exploit smoothness and replace the dependence on $T$ in dynamic regret with problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of these two terms. These quantities are at most $\mathcal{O}(T)$ while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile safeguard the same rate in the worst case. Notably, our proposed algorithms can achieve favorable dynamic regret with only one gradient per iteration, sharing the same gradient query complexity as the static regret minimization methods. To accomplish this, we introduce the collaborative online ensemble framework. The proposed framework employs a two-layer online ensemble to handle non-stationarity, and uses optimistic online learning and further introduces crucial correction terms to enable effective collaboration within the meta-base two layers, thereby attaining adaptivity. We believe the framework can be useful for broader problems. Scaling Speech Technology to 1,000+ Languages http://jmlr.org/papers/v25/23-1318.html http://jmlr.org/papers/volume25/23-1318/23-1318.pdf 2024 Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task while providing improved accuracy compared to prior work. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data. MAP- and MLE-Based Teaching http://jmlr.org/papers/v25/23-1086.html http://jmlr.org/papers/volume25/23-1086/23-1086.pdf 2024 Hans Ulrich Simon, Jan Arne Telle Imagine a learner $L$ who tries to infer a hidden concept from a collection of observations. Building on the work of Ferri et al we assume the learner to be parameterized by priors $P(c)$ and by $c$-conditional likelihoods $P(z|c)$ where $c$ ranges over all concepts in a given class $C$ and $z$ ranges over all observations in an observation set $Z$. $L$ is called a MAP-learner (resp.~an MLE-learner) if it thinks of a collection $S$ of observations as a random sample and returns the concept with the maximum a-posteriori probability (resp.~the concept which maximizes the $c$-conditional likelihood of $S$). Depending on whether $L$ assumes that $S$ is obtained from ordered or unordered sampling resp.~from sampling with or without replacement, we can distinguish four different sampling modes. Given a target concept $c^* \in C$, a teacher for a MAP-learner $L$ aims at finding a smallest collection of observations that causes $L$ to return $c^*$. This approach leads in a natural manner to various notions of a MAP- or MLE-teaching dimension of a concept class $C$. Our main results are as follows. First, we show that this teaching model has some desirable monotonicity properties. Second we clarify how the four sampling modes are related to each other. As for the (important!) special case, where concepts are subsets of a domain and observations are 0,1-labeled examples, we obtain some additional results. First of all, we characterize the MAP- and MLE-teaching dimension associated with an optimally parameterized MAP-learner graph-theoretically. From this central result, some other ones are easy to derive. It is shown, for instance, that the MLE-teaching dimension is either equal to the MAP-teaching dimension or exceeds the latter by $1$. It is shown furthermore that these dimensions can be bounded from above by the so-called antichain number, the VC-dimension and related combinatorial parameters. Moreover they can be computed in polynomial time. A General Framework for the Analysis of Kernel-based Tests http://jmlr.org/papers/v25/23-0985.html http://jmlr.org/papers/volume25/23-0985/23-0985.pdf 2024 Tamara Fernández, Nicolás Rivera Kernel-based tests provide a simple yet effective framework that uses the theory of reproducing kernel Hilbert spaces to design non-parametric testing procedures. In this paper, we propose new theoretical tools that can be used to study the asymptotic behaviour of kernel-based tests in various data scenarios and in different testing problems. Unlike current approaches, our methods avoid working with U and V-statistics expansions that usually lead to lengthy and tedious computations and asymptotic approximations. Instead, we work directly with random functionals on the Hilbert space to analyse kernel-based tests. By harnessing the use of random functionals, our framework leads to much cleaner analyses, involving less tedious computations. Additionally, it offers the advantage of accommodating pre-existing knowledge regarding test-statistics as many of the random functionals considered in applications are known statistics that have been studied comprehensively. To demonstrate the efficacy of our approach, we thoroughly examine two categories of kernel tests, along with three specific examples of kernel tests, including a novel kernel test for conditional independence testing. Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent http://jmlr.org/papers/v25/23-0740.html http://jmlr.org/papers/volume25/23-0740/23-0740.pdf 2024 Jiaming Xu, Hanjing Zhu There have been exciting progresses in understanding the convergence of gradient descent (GD) and stochastic gradient descent (SGD) in overparameterized neural networks through the lens of neural tangent kernel (NTK). However, there remain two significant gaps between theory and practice. First, the existing convergence theory only takes into account the contribution of the NTK from the last hidden layer, while in practice the intermediate layers also play an instrumental role. Second, most existing works assume that the training data are provided a priori in a batch, while less attention has been paid to the important setting where the training data arrive in a stream. In this paper, we close these two gaps. We first show that with random initialization, the NTK function converges to some deterministic function uniformly for all layers as the number of neurons tends to infinity. Then we apply the uniform convergence result to further prove that the prediction error of multi-layer neural networks under SGD converges in expectation in the streaming data setting. A key ingredient in our proof is to show the number of activation patterns of an $L$-layer neural network with width $m$ is only polynomial in $m$ although there are $mL$ neurons in total. Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces http://jmlr.org/papers/v25/23-0645.html http://jmlr.org/papers/volume25/23-0645/23-0645.pdf 2024 Rui Wang, Yuesheng Xu, Mingsong Yan Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is data-dependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs, the sequence space $\ell_1(\mathbb{N})$ and the measure space, can have sparse representer theorems for both MNI and regularization models. Exploration of the Search Space of Gaussian Graphical Models for Paired Data http://jmlr.org/papers/v25/23-0295.html http://jmlr.org/papers/volume25/23-0295/23-0295.pdf 2024 Alberto Roverato, Dung Ngoc Nguyen We consider the problem of learning a Gaussian graphical model in the case where the observations come from two dependent groups sharing the same variables. We focus on a family of coloured Gaussian graphical models specifically suited for the paired data problem. Commonly, graphical models are ordered by the submodel relationship so that the search space is a lattice, called the model inclusion lattice. We introduce a novel order between models, named the twin order. We show that, embedded with this order, the model space is a lattice that, unlike the model inclusion lattice, is distributive. Furthermore, we provide the relevant rules for the computation of the neighbours of a model. The latter are more efficient than the same operations in the model inclusion lattice, and are then exploited to achieve a more efficient exploration of the search space. These results can be applied to improve the efficiency of both greedy and Bayesian model search procedures. Here, we implement a stepwise backward elimination procedure and evaluate its performance both on synthetic and real-world data. The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective http://jmlr.org/papers/v25/22-1312.html http://jmlr.org/papers/volume25/22-1312/22-1312.pdf 2024 Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning. Specific augmentations like translations and scaling in computer vision are traditionally believed to improve generalization by generating new (artificial) data from the same distribution. However, this traditional viewpoint does not explain the success of prevalent augmentations in modern machine learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution. In this work, we develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of phenomena, including discrepancies in generalization between overparameterized and underparameterized regimes and differences between regression and classification tasks. Our framework highlights the nuanced and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation design. Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality http://jmlr.org/papers/v25/22-0832.html http://jmlr.org/papers/volume25/22-0832/22-0832.pdf 2024 Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of Hájek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal. Minimax Rates for High-Dimensional Random Tessellation Forests http://jmlr.org/papers/v25/22-0673.html http://jmlr.org/papers/volume25/22-0673/22-0673.pdf 2024 Eliza O'Reilly, Ngoc Mai Tran Random forests are a popular class of algorithms used for regression and classification. The algorithm introduced by Breiman in 2001 and many of its variants are ensembles of randomized decision trees built from axis-aligned partitions of the feature space. One such variant, called Mondrian forests, was proposed to handle the online setting and is the first class of random forests for which minimax optimal rates were obtained in arbitrary dimension. However, the restriction to axis-aligned splits fails to capture dependencies between features, and random forests that use oblique splits have shown improved empirical performance for many tasks. This work shows that a large class of random forests with general split directions also achieve minimax optimal rates in arbitrary dimension. This class includes STIT forests, a generalization of Mondrian forests to arbitrary split directions, and random forests derived from Poisson hyperplane tessellations. These are the first results showing that random forest variants with oblique splits can obtain minimax optimality in arbitrary dimension. Our proof technique relies on the novel application of the theory of stationary random tessellations in stochastic geometry to statistical learning theory. Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks http://jmlr.org/papers/v25/22-0488.html http://jmlr.org/papers/volume25/22-0488/22-0488.pdf 2024 Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L. Horowitz, Jian Huang We propose a penalized nonparametric approach to estimating the quantile regression process (QRP) in a nonseparable model using rectifier quadratic unit (ReQU) activated deep neural networks and introduce a novel penalty function to enforce non-crossing of quantile regression curves. We establish the non-asymptotic excess risk bounds for the estimated QRP and derive the mean integrated squared error for the estimated QRP under mild smoothness and regularity conditions. To establish these non-asymptotic risk and estimation error bounds, we also develop a new error bound for approximating $C^s$ smooth functions with $s >1$ and their derivatives using ReQU activated neural networks. This is a new approximation result for ReQU networks and is of independent interest and may be useful in other problems. Our numerical experiments demonstrate that the proposed method is competitive with or outperforms two existing methods, including methods using reproducing kernels and random forests for nonparametric quantile regression. Spatial meshing for general Bayesian multivariate models http://jmlr.org/papers/v25/22-0083.html http://jmlr.org/papers/volume25/22-0083/22-0083.pdf 2024 Michele Peruzzi, David B. Dunson Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package meshed, available on CRAN. A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables http://jmlr.org/papers/v25/21-1181.html http://jmlr.org/papers/volume25/21-1181/21-1181.pdf 2024 Wei Luo, Yeying Zhu, Xuekui Zhang, Lin Lin In the application of instrumental variable analysis that conducts causal inference in the presence of unmeasured confounding, invalid instrumental variables and weak instrumental variables often exist which complicate the analysis. In this paper, we propose a model-free dimension reduction procedure to select the invalid instrumental variables and refine them into lower-dimensional linear combinations. The procedure also combines the weak instrumental variables into a few stronger instrumental variables that best condense their information. We then introduce the personalized dose-response function that incorporates the subject's personal characteristics into the conventional dose-response function, and use the reduced data from dimension reduction to propose a novel and easily implementable nonparametric estimator of this function. The proposed approach is suitable for both discrete and continuous treatment variables, and is robust to the dimensionality of data. Its effectiveness is illustrated by the simulation studies and the data analysis of ADNI-DoD study, where the causal relationship between depression and dementia is investigated. Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport http://jmlr.org/papers/v25/21-0022.html http://jmlr.org/papers/volume25/21-0022/21-0022.pdf 2024 Ricardo Baptista, Youssef Marzouk, Rebecca Morrison, Olivier Zahm Undirected probabilistic graphical models represent the conditional dependencies, or Markov properties, of a collection of random variables. Knowing the sparsity of such a graphical model is valuable for modeling multivariate distributions and for efficiently performing inference. While the problem of learning graph structure from data has been studied extensively for certain parametric families of distributions, most existing methods fail to consistently recover the graph structure for non-Gaussian data. Here we propose an algorithm for learning the Markov structure of continuous and non-Gaussian distributions. To characterize conditional independence, we introduce a score based on integrated Hessian information from the joint log-density, and we prove that this score upper bounds the conditional mutual information for a general class of distributions. To compute the score, our algorithm SING estimates the density using a deterministic coupling, induced by a triangular transport map, and iteratively exploits sparse structure in the map to reveal sparsity in the graph. For certain non-Gaussian datasets, we show that our algorithm recovers the graph structure even with a biased approximation to the density. Among other examples, we apply SING to learn the dependencies between the states of a chaotic dynamical system with local interactions. On the Learnability of Out-of-distribution Detection http://jmlr.org/papers/v25/23-1257.html http://jmlr.org/papers/volume25/23-1257/23-1257.pdf 2024 Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms, and corresponding learning theory is still an open problem. To study the generalization of OOD detection, this paper investigates the probably approximately correct (PAC) learning theory of OOD detection that fits the commonly used evaluation metrics in the literature. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we offer theoretical support for representative OOD detection works based on our OOD theory. Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training http://jmlr.org/papers/v25/23-1073.html http://jmlr.org/papers/volume25/23-1073/23-1073.pdf 2024 Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan Training deep networks on large-scale datasets is computationally challenging. This work explores the problem of “how to accelerate adaptive gradient algorithms in a general manner", and proposes an effective Weight-decay-Integrated Nesterov acceleration (Win) to accelerate adaptive algorithms. Taking AdamW and Adam as examples, per iteration, we construct a dynamical loss that combines the vanilla training loss and a dynamic regularizer inspired by proximal point method, and respectively minimize the first- and second-order Taylor approximations of dynamical loss to update variable. This yields our Win acceleration that uses a conservative step and an aggressive step to update, and linearly combines these two updates for acceleration. Next, we extend Win into Win2 which uses multiple aggressive update steps for faster convergence. Then we apply Win and Win2 to the popular LAMB and SGD optimizers. Our transparent derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we theoretically justify the faster convergence of Win- and Win2-accelerated AdamW, Adam and LAMB to their non-accelerated counterparts. Experimental results demonstrates the faster convergence speed and superior performance of our Win- and Win2-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification and language modeling tasks. On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains http://jmlr.org/papers/v25/23-0866.html http://jmlr.org/papers/volume25/23-0866/23-0866.pdf 2024 Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin In this paper, we provide a strategy to determine the eigenvalue decay rate (EDR) of a large class of kernel functions defined on a general domain rather than $\mathbb{S}^{d}$. This class of kernel functions include but are not limited to the neural tangent kernel associated with neural networks with different depths and various activation functions. After proving that the dynamics of training the wide neural networks uniformly approximated that of the neural tangent kernel regression on general domains, we can further illustrate the minimax optimality of the wide neural network provided that the underground truth function $f\in [\mathcal H_{\mathrm{NTK}}]^{s}$, an interpolation space associated with the RKHS $\mathcal{H}_{\mathrm{NTK}}$ of NTK. We also showed that the overfitted neural network can not generalize well. We believe our approach for determining the EDR of kernels might be also of independent interests. Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions http://jmlr.org/papers/v25/23-0698.html http://jmlr.org/papers/volume25/23-0698/23-0698.pdf 2024 Maksim Velikanov, Dmitry Yarotsky Performance of optimization on quadratic problems sensitively depends on the low-lying part of the spectrum. For large (effectively infinite-dimensional) problems, this part of the spectrum can often be naturally represented or approximated by power law distributions, resulting in power law convergence rates for iterative solutions of these problems by gradient-based algorithms. In this paper, we propose a new spectral condition providing tighter upper bounds for problems with power law optimization trajectories. We use this condition to build a complete picture of upper and lower bounds for a wide range of optimization algorithms - Gradient Descent, Steepest Descent, Heavy Ball, and Conjugate Gradients - with an emphasis on the underlying schedules of learning rate and momentum. In particular, we demonstrate how an optimally accelerated method, its schedule, and convergence upper bound can be obtained in a unified manner for a given shape of the spectrum. Also, we provide first proofs of tight lower bounds for convergence rates of Steepest Descent and Conjugate Gradients under spectral power laws with general exponents. Our experiments show that the obtained convergence bounds and acceleration strategies are not only relevant for exactly quadratic optimization problems, but also fairly accurate when applied to the training of neural networks. ptwt - The PyTorch Wavelet Toolbox http://jmlr.org/papers/v25/23-0636.html http://jmlr.org/papers/volume25/23-0636/23-0636.pdf 2024 Moritz Wolter, Felix Blanke, Jochen Garcke, Charles Tapley Hoyt The fast wavelet transform is an essential workhorse in signal processing. Wavelets are local in the spatial- or temporal- and the frequency-domain. This property enables frequency domain analysis while preserving some spatiotemporal information. Until recently, wavelets rarely appeared in the machine learning literature. We provide the PyTorch Wavelet Toolbox to make wavelet methods more accessible to the deep learning community. Our PyTorch Wavelet Toolbox is well documented. A pip package is installable with `pip install ptwt`. Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria http://jmlr.org/papers/v25/23-0188.html http://jmlr.org/papers/volume25/23-0188/23-0188.pdf 2024 Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker Selecting the number of topics in Latent Dirichlet Allocation (LDA) models is considered to be a difficult task, for which various approaches have been proposed. In this paper the performance of the recently developed singular Bayesian information criterion (sBIC) is evaluated and compared to the performance of alternative model selection criteria. The sBIC is a generalization of the standard BIC that can be applied to singular statistical models. The comparison is based on Monte Carlo simulations and carried out for several alternative settings, varying with respect to the number of topics, the number of documents and the size of documents in the corpora. Performance is measured using different criteria which take into account the correct number of topics, but also whether the relevant topics from the considered data generation processes (DGPs) are revealed. Practical recommendations for LDA model selection in applications are derived. Functional Directed Acyclic Graphs http://jmlr.org/papers/v25/22-1038.html http://jmlr.org/papers/volume25/22-1038/22-1038.pdf 2024 Kuang-Yao Lee, Lexin Li, Bing Li In this article, we introduce a new method to estimate a directed acyclic graph (DAG) from multivariate functional data. We build on the notion of faithfulness that relates a DAG with a set of conditional independences among the random functions. We develop two linear operators, the conditional covariance operator and the partial correlation operator, to characterize and evaluate the conditional independence. Based on these operators, we adapt and extend the PC-algorithm to estimate the functional directed graph, so that the computation time depends on the sparsity rather than the full size of the graph. We study the asymptotic properties of the two operators, derive their uniform convergence rates, and establish the uniform consistency of the estimated graph, all of which are obtained while allowing the graph size to diverge to infinity with the sample size. We demonstrate the efficacy of our method through both simulations and an application to a time-course proteomic dataset. Unlabeled Principal Component Analysis and Matrix Completion http://jmlr.org/papers/v25/22-0816.html http://jmlr.org/papers/volume25/22-0816/22-0816.pdf 2024 Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris We introduce robust principal component analysis from a data matrix in which the entries of its columns have been corrupted by permutations, termed Unlabeled Principal Component Analysis (UPCA). Using algebraic geometry, we establish that UPCA is a well-defined algebraic problem since we prove that the only matrices of minimal rank that agree with the given data are row-permutations of the ground-truth matrix, arising as the unique solutions of a polynomial system of equations. Further, we propose an efficient two-stage algorithmic pipeline for UPCA suitable for the practically relevant case where only a fraction of the data have been permuted. Stage-I employs outlier-robust PCA methods to estimate the ground-truth column-space. Equipped with the column-space, Stage-II applies recent methods for unlabeled sensing to restore the permuted data. Allowing for missing entries on top of permutations in UPCA leads to the problem of unlabeled matrix completion, for which we derive theory and algorithms of similar flavor. Experiments on synthetic data, face images, educational and medical records reveal the potential of our algorithms for applications such as data privatization and record linkage. Distributed Estimation on Semi-Supervised Generalized Linear Model http://jmlr.org/papers/v25/22-0670.html http://jmlr.org/papers/volume25/22-0670/22-0670.pdf 2024 Jiyuan Tu, Weidong Liu, Xiaojun Mao Semi-supervised learning is devoted to using unlabeled data to improve the performance of machine learning algorithms. In this paper, we study the semi-supervised generalized linear model (GLM) in the distributed setup. In the cases of single or multiple machines containing unlabeled data, we propose two distributed semi-supervised algorithms based on the distributed approximate Newton method. When the labeled local sample size is small, our algorithms still give a consistent estimation, while fully supervised methods fail to converge. Moreover, we theoretically prove that the convergence rate is greatly improved when sufficient unlabeled data exists. Therefore, the proposed method requires much fewer rounds of communications to achieve the optimal rate than its fully-supervised counterpart. In the case of the linear model, we prove the rate lower bound after one round of communication, which shows that rate improvement is essential. Finally, several simulation analyses and real data studies are provided to demonstrate the effectiveness of our method. Towards Explainable Evaluation Metrics for Machine Translation http://jmlr.org/papers/v25/22-0416.html http://jmlr.org/papers/volume25/22-0416/22-0416.pdf 2024 Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems. Differentially private methods for managing model uncertainty in linear regression http://jmlr.org/papers/v25/21-1536.html http://jmlr.org/papers/volume25/21-1536/21-1536.pdf 2024 Víctor Peña, Andrés F. Barrientos In this article, we propose differentially private methods for hypothesis testing, model averaging, and model selection for normal linear models. We propose Bayesian methods based on mixtures of $g$-priors and non-Bayesian methods based on likelihood-ratio statistics and information criteria. The procedures are asymptotically consistent and straightforward to implement with existing software. We focus on practical issues such as adjusting critical values so that hypothesis tests have adequate type I error rates and quantifying the uncertainty introduced by the privacy-ensuring mechanisms. Data Summarization via Bilevel Optimization http://jmlr.org/papers/v25/21-1132.html http://jmlr.org/papers/volume25/21-1132/21-1132.pdf 2024 Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause The increasing availability of massive data sets poses various challenges for machine learning. Prominent among these is learning models under hardware or human resource constraints. In such resource-constrained settings, a simple yet powerful approach is operating on small subsets of the data. Coresets are weighted subsets of the data that provide approximation guarantees for the optimization objective. However, existing coreset constructions are highly model-specific and are limited to simple models such as linear regression, logistic regression, and k-means. In this work, we propose a generic coreset construction framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem. In contrast to existing approaches, our framework does not require model-specific adaptations and applies to any twice differentiable model, including neural networks. We show the effectiveness of our framework for a wide range of models in various settings, including training non-convex models online and batch active learning. Pareto Smoothed Importance Sampling http://jmlr.org/papers/v25/19-556.html http://jmlr.org/papers/volume25/19-556/19-556.pdf 2024 Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, Jonah Gabry Importance weighting is a general way to adjust Monte Carlo integration to account for draws from the wrong distribution, but the resulting estimate can be highly variable when the importance ratios have a heavy right tail. This routinely occurs when there are aspects of the target distribution that are not well captured by the approximating distribution, in which case more stable estimates can be obtained by modifying extreme importance ratios. We present a new method for stabilizing importance weights using a generalized Pareto distribution fit to the upper tail of the distribution of the simulated importance ratios. The method, which empirically performs better than existing methods for stabilizing importance sampling estimates, includes stabilized effective sample size estimates, Monte Carlo error estimates, and convergence diagnostics. The presented Pareto $\hat{k}$ finite sample convergence rate diagnostic is useful for any Monte Carlo estimator. Policy Gradient Methods in the Presence of Symmetries and State Abstractions http://jmlr.org/papers/v25/23-1415.html http://jmlr.org/papers/volume25/23-1415/23-1415.pdf 2024 Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup Reinforcement learning (RL) on high-dimensional and complex problems relies on abstraction for improved efficiency and generalization. In this paper, we study abstraction in the continuous-control setting, and extend the definition of Markov decision process (MDP) homomorphisms to the setting of continuous state and action spaces. We derive a policy gradient theorem on the abstract MDP for both stochastic and deterministic policies. Our policy gradient results allow for leveraging approximate symmetries of the environment for policy optimization. Based on these theorems, we propose a family of actor-critic algorithms that are able to learn the policy and the MDP homomorphism map simultaneously, using the lax bisimulation metric. Finally, we introduce a series of environments with continuous symmetries to further demonstrate the ability of our algorithm for action abstraction in the presence of such symmetries. We demonstrate the effectiveness of our method on our environments, as well as on challenging visual control tasks from the DeepMind Control Suite. Our method's ability to utilize MDP homomorphisms for representation learning leads to improved performance, and the visualizations of the latent space clearly demonstrate the structure of the learned abstraction. Scaling Instruction-Finetuned Language Models http://jmlr.org/papers/v25/23-0870.html http://jmlr.org/papers/volume25/23-0870/23-0870.pdf 2024 Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks (at time of release), such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. Tangential Wasserstein Projections http://jmlr.org/papers/v25/23-0708.html http://jmlr.org/papers/volume25/23-0708/23-0708.pdf 2024 Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee We develop a notion of projections between sets of probability measures using the geometric properties of the $2$-Wasserstein space. In contrast to existing methods, it is designed for multivariate probability measures that need not be regular, and is computationally efficient to implement via regression. The idea is to work on tangent cones of the Wasserstein space using generalized geodesics. Its structure and computational properties make the method applicable in a variety of settings where probability measures need not be regular, from causal inference to the analysis of object data. An application to estimating causal effects yields a generalization of the synthetic controls method for systems with general heterogeneity described via multivariate probability measures. Learnability of Linear Port-Hamiltonian Systems http://jmlr.org/papers/v25/23-0450.html http://jmlr.org/papers/volume25/23-0450/23-0450.pdf 2024 Juan-Pablo Ortega, Daiying Yin A complete structure-preserving learning scheme for single-input/single-output (SISO) linear port-Hamiltonian systems is proposed. The construction is based on the solution, when possible, of the unique identification problem for these systems, in ways that reveal fundamental relationships between classical notions in control theory and crucial properties in the machine learning context, like structure-preservation and expressive power. In the canonical case, it is shown that, {up to initializations,} the set of uniquely identified systems can be explicitly characterized as a smooth manifold endowed with global Euclidean coordinates, which allows concluding that the parameter complexity necessary for the replication of the dynamics is only $\mathcal{O}(n)$ and not $\mathcal{O}(n^2)$, as suggested by the standard parametrization of these systems. Furthermore, it is shown that linear port-Hamiltonian systems can be learned while remaining agnostic about the dimension of the underlying data-generating system. Numerical experiments show that this methodology can be used to efficiently estimate linear port-Hamiltonian systems out of input-output realizations, making the contributions in this paper the first example of a structure-preserving machine learning paradigm for linear port-Hamiltonian systems based on explicit representations of this model category. Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning http://jmlr.org/papers/v25/23-0413.html http://jmlr.org/papers/volume25/23-0413/23-0413.pdf 2024 Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been developed for differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance. On Unbiased Estimation for Partially Observed Diffusions http://jmlr.org/papers/v25/23-0347.html http://jmlr.org/papers/volume25/23-0347/23-0347.pdf 2024 Jeremy Heng, Jeremie Houssineau, Ajay Jasra We consider a class of diffusion processes with finite-dimensional parameters and partially observed at discrete time instances. We propose a methodology to unbiasedly estimate the expectation of a given functional of the diffusion process conditional on parameters and data. When these unbiased estimators with appropriately chosen functionals are employed within an expectation-maximization algorithm or a stochastic gradient method, this enables statistical inference using the maximum likelihood or Bayesian framework. Compared to existing approaches, the use of our unbiased estimators allows one to remove any time-discretization bias and Markov chain Monte Carlo burn-in bias. Central to our methodology is a novel and natural combination of multilevel randomization schemes and unbiased Markov chain Monte Carlo methods, and the development of new couplings of multiple conditional particle filters. We establish under assumptions that our estimators are unbiased and have finite variance. We illustrate various aspects of our method on an Ornstein--Uhlenbeck model, a logistic diffusion model for population dynamics, and a neural network model for grid cells. Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions http://jmlr.org/papers/v25/22-1347.html http://jmlr.org/papers/volume25/22-1347/22-1347.pdf 2024 Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser Lipschitz-constrained neural networks have several advantages over unconstrained ones and can be applied to a variety of problems, making them a topic of attention in the deep learning community. Unfortunately, it has been shown both theoretically and empirically that they perform poorly when equipped with ReLU activation functions. By contrast, neural networks with learnable 1-Lipschitz linear splines are known to be more expressive. In this paper, we show that such networks correspond to global optima of a constrained functional optimization problem that consists of the training of a neural network composed of 1-Lipschitz linear layers and 1-Lipschitz freeform activation functions with second-order total-variation regularization. Further, we propose an efficient method to train these neural networks. Our numerical experiments show that our trained networks compare favorably with existing 1-Lipschitz neural architectures. Mathematical Framework for Online Social Media Auditing http://jmlr.org/papers/v25/22-1112.html http://jmlr.org/papers/volume25/22-1112/22-1112.pdf 2024 Wasim Huleihel, Yehonathan Refael Social media platforms (SMPs) leverage algorithmic filtering (AF) as a means of selecting the content that constitutes a user's feed with the aim of maximizing their rewards. Selectively choosing the contents to be shown on the user's feed may yield a certain extent of influence, either minor or major, on the user's decision-making, compared to what it would have been under a natural/fair content selection. As we have witnessed over the past decade, algorithmic filtering can cause detrimental side effects, ranging from biasing individual decisions to shaping those of society as a whole, for example, diverting users' attention from whether to get the COVID-19 vaccine or inducing the public to choose a presidential candidate. The government's constant attempts to regulate the adverse effects of AF are often complicated, due to bureaucracy, legal affairs, and financial considerations. On the other hand SMPs seek to monitor their own algorithmic activities to avoid being fined for exceeding the allowable threshold. In this paper, we mathematically formalize this framework and utilize it to construct a data-driven statistical auditing procedure to regulate AF from deflecting users' beliefs over time, along with sample complexity guarantees. This state-of-the-art algorithm can be used either by authorities acting as external regulators or by SMPs for self-auditing. An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates http://jmlr.org/papers/v25/22-0743.html http://jmlr.org/papers/volume25/22-0743/22-0743.pdf 2024 Jessie Finocchiaro, Rafael M. Frongillo, Bo Waggoner We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for discrete problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g. rankings) as a point in $\mathbb{R}^d$, assigns the original loss values to these points, and “convexifies” the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses: every discrete loss is embedded by some polyhedral loss, and every polyhedral loss embeds some discrete loss. Moreover, an embedding gives rise to a consistent link function as well as linear surrogate regret bounds. Our results are constructive, as we illustrate with several examples. In particular, our framework gives succinct proofs of consistency or inconsistency for existing polyhedral surrogates, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We go on to show additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates. Low-rank Variational Bayes correction to the Laplace method http://jmlr.org/papers/v25/21-1405.html http://jmlr.org/papers/volume25/21-1405/21-1405.pdf 2024 Janet van Niekerk, Haavard Rue Approximate inference methods like the Laplace method, Laplace approximations and variational methods, amongst others, are popular methods when exact inference is not feasible due to the complexity of the model or the abundance of data. In this paper we propose a hybrid approximate method called Low-Rank Variational Bayes correction (VBC), that uses the Laplace method and subsequently a Variational Bayes correction in a lower dimension, to the joint posterior mean. The cost is essentially that of the Laplace method which ensures scalability of the method, in both model complexity and data size. Models with fixed and unknown hyperparameters are considered, for simulated and real examples, for small and large data sets. Scaling the Convex Barrier with Sparse Dual Algorithms http://jmlr.org/papers/v25/21-0076.html http://jmlr.org/papers/volume25/21-0076/21-0076.pdf 2024 Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar Tight and efficient neural network bounding is crucial to the scaling of neural network verification systems. Many efficient bounding algorithms have been presented recently, but they are often too loose to verify more challenging properties. This is due to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise-linear activations exists, it comes at the cost of exponentially many constraints and currently lacks an efficient customized solver. We alleviate this deficiency by presenting two novel dual algorithms: one operates a subgradient method on a small active set of dual variables, the other exploits the sparsity of Frank-Wolfe type optimizers to incur only a linear memory cost. Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle. At the same time, they share the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we can obtain better bounds than off-the-shelf solvers in only a fraction of their running time, attaining significant formal verification speed-ups. Causal-learn: Causal Discovery in Python http://jmlr.org/papers/v25/23-0970.html http://jmlr.org/papers/volume25/23-0970/23-0970.pdf 2024 Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, Kun Zhang Causal discovery aims at revealing causal relations from observational data, which is a fundamental task in science and engineering. We describe causal-learn, an open-source Python library for causal discovery. This library focuses on bringing a comprehensive collection of causal discovery methods to both practitioners and researchers. It provides easy-to-use APIs for non-specialists, modular building blocks for developers, detailed documentation for learners, and comprehensive methods for all. Different from previous packages in R or Java, causal-learn is fully developed in Python, which could be more in tune with the recent preference shift in programming languages within related communities. The library is available at https://github.com/py-why/causal-learn. Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics http://jmlr.org/papers/v25/23-0777.html http://jmlr.org/papers/volume25/23-0777/23-0777.pdf 2024 Noga Mudrik, Yenho Chen, Eva Yezerets, Christopher J. Rozell, Adam S. Charles Learning interpretable representations of neural dynamics at a population level is a crucial first step to understanding how observed neural activity relates to perception and behavior. Models of neural dynamics often focus on either low-dimensional projections of neural activity or on learning dynamical systems that explicitly relate to the neural state over time. We discuss how these two approaches are interrelated by considering dynamical systems as representative of flows on a low-dimensional manifold. Building on this concept, we propose a new decomposed dynamical system model that represents complex non-stationary and nonlinear dynamics of time series data as a sparse combination of simpler, more interpretable components. Our model is trained through a dictionary learning procedure, where we leverage recent results in tracking sparse vectors over time. The decomposed nature of the dynamics is more expressive than previous switched approaches for a given number of parameters and enables modeling of overlapping and non-stationary dynamics. In both continuous-time and discrete-time instructional examples, we demonstrate that our model effectively approximates the original system, learns efficient representations, and captures smooth transitions between dynamical modes. Furthermore, we highlight our model’s ability to efficiently capture and demix population dynamics generated from multiple independent subnetworks, a task that is computationally impractical for switched models. Finally, we apply our model to neural “full brain” recordings of C. elegans data, illustrating a diversity of dynamics that is obscured when classified into discrete states. Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification http://jmlr.org/papers/v25/23-0456.html http://jmlr.org/papers/volume25/23-0456/23-0456.pdf 2024 Natalie S. Frank, Jonathan Niles-Weed We prove existence, minimax, and complementary slackness theorems for adversarial surrogate risks in binary classification. These results extend recent work that established analogous minimax and existence theorems for the adversarial classification risk. We show that such statements continue to hold for a very general class of surrogate losses; moreover, we remove some of the technical restrictions present in prior work. Our results provide an explanation for the phenomenon of transfer attacks and inform new directions in algorithm development. Data Thinning for Convolution-Closed Distributions http://jmlr.org/papers/v25/23-0446.html http://jmlr.org/papers/volume25/23-0446/23-0446.pdf 2024 Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable. A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity http://jmlr.org/papers/v25/23-0371.html http://jmlr.org/papers/volume25/23-0371/23-0371.pdf 2024 Jiang Hu, Kangkang Deng, Jiayuan Wu, Quanzheng Li This paper aims to develop a Newton-type method to solve a class of nonconvex composite programs. In particular, the nonsmooth part is possibly nonconvex. To tackle the nonconvexity, we develop a notion of strong prox-regularity which is related to the singleton property and Lipschitz continuity of the associated proximal operator, and we verify it in various classes of functions, including weakly convex functions, indicator functions of proximally smooth sets, and two specific sphere-related nonconvex nonsmooth functions. In this case, the problem class we are concerned with covers smooth optimization problems on manifold and certain composite optimization problems on manifold. For the latter, the proposed algorithm is the first second-order type method. Combining with the semismoothness of the proximal operator, we design a projected semismooth Newton method to find a root of the natural residual induced by the proximal gradient method. Due to the possible nonconvexity of the feasible domain, an extra projection is added to the usual semismooth Newton step and new criteria are proposed for the switching between the projected semismooth Newton step and the proximal step. The global convergence is then established under the strong prox-regularity. Based on the BD regularity condition, we establish local superlinear convergence. Numerical experiments demonstrate the effectiveness of our proposed method compared with state-of-the-art ones. Revisiting RIP Guarantees for Sketching Operators on Mixture Models http://jmlr.org/papers/v25/23-0044.html http://jmlr.org/papers/volume25/23-0044/23-0044.pdf 2024 Ayoub Belhadji, Rémi Gribonval In the context of sketching for compressive mixture modeling, we revisit existing proofs of the Restricted Isometry Property of sketching operators with respect to certain mixtures models. After examining the shortcomings of existing guarantees, we propose an alternative analysis that circumvents the need to assume importance sampling when drawing random Fourier features to build random sketching operators. Our analysis is based on new deterministic bounds on the restricted isometry constant that depend solely on the set of frequencies used to define the sketching operator; then we leverage these bounds to establish concentration inequalities for random sketching operators that lead to the desired RIP guarantees. Our analysis also opens the door to theoretical guarantees for structured sketching with frequencies associated to fast random linear operators. Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization http://jmlr.org/papers/v25/22-1197.html http://jmlr.org/papers/volume25/22-1197/22-1197.pdf 2024 Daniel LeJeune, Jiayu Liu, Reinhard Heckel Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems. Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks http://jmlr.org/papers/v25/22-0796.html http://jmlr.org/papers/volume25/22-0796/22-0796.pdf 2024 Dong-Young Lim, Sotirios Sabanis We present a new class of Langevin-based algorithms, which overcomes many of the known shortcomings of popular adaptive optimizers that are currently used for the fine tuning of deep learning models. Its underpinning theory relies on recent advances of Euler-Krylov polygonal approximations for stochastic differential equations (SDEs) with monotone coefficients. As a result, it inherits the stability properties of tamed algorithms, while it addresses other known issues, e.g. vanishing gradients in deep learning. In particular, we provide a nonasymptotic analysis and full theoretical guarantees for the convergence properties of an algorithm of this novel class, which we named TH$\varepsilon$O POULA (or, simply, TheoPouLa). Finally, several experiments are presented with different types of deep learning models, which show the superior performance of TheoPouLa over many popular adaptive optimization algorithms. Axiomatic effect propagation in structural causal models http://jmlr.org/papers/v25/22-0285.html http://jmlr.org/papers/volume25/22-0285/22-0285.pdf 2024 Raghav Singal, George Michailidis We study effect propagation in a causal directed acyclic graph (DAG), with the goal of providing a flow-based decomposition of the effect (i.e., change in the outcome variable) as a result of changes in the source variables. We first compare various ideas on causality to quantify effect propagation, such as direct and indirect effects, path-specific effects, and degree of responsibility. We discuss the shortcomings of such approaches and propose a flow-based methodology, which we call recursive Shapley value (RSV). By considering a broader set of counterfactuals than existing methods, RSV obeys a unique adherence to four desirable flow-based axioms. Further, we provide a general path-based characterization of RSV for an arbitrary non-parametric structural equations model (SEM) defined on the underlying DAG. Interestingly, for the special class of linear SEMs, RSV exhibits a simple and tractable characterization (and hence, computation), which recovers the classical method of path coefficients and is equivalent to path-specific effects. For non-parametric SEMs, we use our general characterization to develop an unbiased Monte-Carlo estimation procedure with an exponentially decaying sample complexity. We showcase the application of RSV on two challenging problems on causality (causal overdetermination and causal unfairness). Optimal First-Order Algorithms as a Function of Inequalities http://jmlr.org/papers/v25/21-1256.html http://jmlr.org/papers/volume25/21-1256/21-1256.pdf 2024 Chanwoo Park, Ernest K. Ryu In this work, we present a novel algorithm design methodology that finds the optimal algorithm as a function of inequalities. Specifically, we restrict convergence analyses of algorithms to use a prespecified subset of inequalities, rather than utilizing all true inequalities, and find the optimal algorithm subject to this restriction. This methodology allows us to design algorithms with certain desired characteristics. As concrete demonstrations of this methodology, we find new state-of-the-art accelerated first-order gradient methods using randomized coordinate updates and backtracking line searches. Resource-Efficient Neural Networks for Embedded Systems http://jmlr.org/papers/v25/18-566.html http://jmlr.org/papers/volume25/18-566/18-566.pdf 2024 Wolfgang Roth, Günther Schindler, Bernhard Klein, Robert Peharz, Sebastian Tschiatschek, Holger Fröning, Franz Pernkopf, Zoubin Ghahramani While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and prediction quality. Trained Transformers Learn Linear Models In-Context http://jmlr.org/papers/v25/23-1042.html http://jmlr.org/papers/volume25/23-1042/23-1042.pdf 2024 Ruiqi Zhang, Spencer Frei, Peter L. Bartlett Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts. Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees http://jmlr.org/papers/v25/23-0576.html http://jmlr.org/papers/volume25/23-0576/23-0576.pdf 2024 Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh In this paper, we present a comprehensive study on the convergence properties of Adam-family methods for nonsmooth optimization, especially in the training of nonsmooth neural networks. We introduce a novel two-timescale framework that adopts a two-timescale updating scheme, and prove its convergence properties under mild assumptions. Our proposed framework encompasses various popular Adam-family methods, providing convergence guarantees for these methods in training nonsmooth neural networks. Furthermore, we develop stochastic subgradient methods that incorporate gradient clipping techniques for training nonsmooth neural networks with heavy-tailed noise. Through our framework, we show that our proposed methods converge even when the evaluation noises are only assumed to be integrable. Extensive numerical experiments demonstrate the high efficiency and robustness of our proposed methods. Efficient Modality Selection in Multimodal Learning http://jmlr.org/papers/v25/23-0439.html http://jmlr.org/papers/volume25/23-0439/23-0439.pdf 2024 Yifei He, Runxiang Cheng, Gargi Balasubramaniam, Yao-Hung Hubert Tsai, Han Zhao Multimodal learning aims to learn from data of different modalities by fusing information from heterogeneous sources. Although it is beneficial to learn from more modalities, it is often infeasible to use all available modalities under limited computational resources. Modeling with all available modalities can also be inefficient and unnecessary when information across input modalities overlaps. In this paper, we study the modality selection problem, which aims to select the most useful subset of modalities for learning under a cardinality constraint. To that end, we propose a unified theoretical framework to quantify the learning utility of modalities, and we identify dependence assumptions to flexibly model the heterogeneous nature of multimodal data, which also allows efficient algorithm design. Accordingly, we derive a greedy modality selection algorithm via submodular maximization, which selects the most useful modalities with an optimality guarantee on learning performance. We also connect marginal-contribution-based feature importance scores, such as Shapley value, from the feature selection domain to the context of modality selection, to efficiently compute the importance of individual modality. We demonstrate the efficacy of our theoretical results and modality selection algorithms on 2 synthetic and 4 real-world data sets on a diverse range of multimodal data. A Multilabel Classification Framework for Approximate Nearest Neighbor Search http://jmlr.org/papers/v25/23-0286.html http://jmlr.org/papers/volume25/23-0286/23-0286.pdf 2024 Ville Hyvönen, Elias Jääsaari, Teemu Roos To learn partition-based index structures for approximate nearest neighbor (ANN) search, both supervised and unsupervised machine learning algorithms have been used. Existing supervised algorithms select all the points that belong to the same partition element as the query point as nearest neighbor candidates. Consequently, they formulate the learning task as finding a partition in which the nearest neighbors of a query point belong to the same partition element with it as often as possible. In contrast, we formulate the candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point. In the proposed framework, partition-based index structures are interpreted as partitioning classifiers for solving this classification problem. Empirical results suggest that, when combined with any partitioning strategy, the natural classifier based on the proposed framework leads to a strictly improved performance compared to the earlier candidate set selection methods. We also prove a sufficient condition for the consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees and (both dense and sparse) random projection trees. Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization http://jmlr.org/papers/v25/23-0038.html http://jmlr.org/papers/volume25/23-0038/23-0038.pdf 2024 Lorenzo Pacchiardi, Rilwan A. Adewoyin, Peter Dueben, Ritabrata Dutta Probabilistic forecasting relies on past observations to provide a probability distribution for a future outcome, which is often evaluated against the realization using a scoring rule. Here, we perform probabilistic forecasting with generative neural networks, which parametrize distributions on high-dimensional spaces by transforming draws from a latent variable. Generative networks are typically trained in an adversarial framework. In contrast, we propose to train generative networks to minimize a predictive-sequential (or prequential) scoring rule on a recorded temporal sequence of the phenomenon of interest, which is appealing as it corresponds to the way forecasting systems are routinely evaluated. Adversarial-free minimization is possible for some scoring rules; hence, our framework avoids the cumbersome hyperparameter tuning and uncertainty underestimation due to unstable adversarial training, thus unlocking reliable use of generative networks in probabilistic forecasting. Further, we prove consistency of the minimizer of our objective with dependent data, while adversarial training assumes independence. We perform simulation studies on two chaotic dynamical models and a benchmark data set of global weather observations; for this last example, we define scoring rules for spatial data by drawing from the relevant literature. Our method outperforms state-of-the-art adversarial approaches, especially in probabilistic calibration, while requiring less hyperparameter tuning. Multiple Descent in the Multiple Random Feature Model http://jmlr.org/papers/v25/22-1389.html http://jmlr.org/papers/volume25/22-1389/22-1389.pdf 2024 Xuran Meng, Jianfeng Yao, Yuan Cao Recent works have demonstrated a double descent phenomenon in over-parameterized learning. Although this phenomenon has been investigated by recent works, it has not been fully understood in theory. In this paper, we investigate the multiple descent phenomenon in a class of multi-component prediction models. We first consider a "double random feature model" (DRFM) concatenating two types of random features, and study the excess risk achieved by the DRFM in ridge regression. We calculate the precise limit of the excess risk under the high dimensional framework where the training sample size, the dimension of data, and the dimension of random features tend to infinity proportionally. Based on the calculation, we further theoretically demonstrate that the risk curves of DRFMs can exhibit triple descent. We then provide a thorough experimental study to verify our theory. At last, we extend our study to the "multiple random feature model" (MRFM), and show that MRFMs ensembling $K$ types of random features may exhibit $(K+1)$-fold descent. Our analysis points out that risk curves with a specific number of descent generally exist in learning multi-component prediction models. Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling http://jmlr.org/papers/v25/22-1198.html http://jmlr.org/papers/volume25/22-1198/22-1198.pdf 2024 Ye He, Tyler Farghly, Krishnakumar Balasubramanian, Murat A. Erdogdu We analyze the complexity of sampling from a class of heavy-tailed distributions by discretizing a natural class of Itô diffusions associated with weighted Poincaré inequalities. Based on a mean-square analysis, we establish the iteration complexity for obtaining a sample whose distribution is $\epsilon$ close to the target distribution in the Wasserstein-2 metric. In this paper, our results take the mean-square analysis to its limits, i.e., we invariably only require that the target density has finite variance, the minimal requirement for a mean-square analysis. To obtain explicit estimates, we compute upper bounds on certain moments associated with heavy-tailed targets under various assumptions. We also provide similar iteration complexity results for the case where only function evaluations of the unnormalized target density are available by estimating the gradients using a Gaussian smoothing technique. We provide illustrative examples based on the multivariate $t$-distribution. Invariant and Equivariant Reynolds Networks http://jmlr.org/papers/v25/22-0891.html http://jmlr.org/papers/volume25/22-0891/22-0891.pdf 2024 Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai Various data exhibit symmetry, including permutations in graphs and point clouds. Machine learning methods that utilize this symmetry have achieved considerable success. In this study, we explore learning models for data exhibiting group symmetry. Our focus is on transforming deep neural networks using Reynolds operators, which average over the group to convert a function into an invariant or equivariant form. While learning methods based on Reynolds operators are well-established, they often face computational complexity challenges. To address this, we introduce two new methods that reduce the computational burden associated with the Reynolds operator: (i) Although the Reynolds operator traditionally averages over the entire group, we demonstrate that it can be effectively approximated by averaging over specific subsets of the group, termed the Reynolds design. (ii) We reveal that the pre-model does not require all input variables. Instead, using a select number of partial inputs (Reynolds dimension) is sufficient to achieve a universally applicable model. Employing these methods, which hinge on the Reynolds design and Reynolds dimension concepts, allows us to construct universally applicable models with manageable computational complexity. Our experiments on benchmark data indicate that our approach is more efficient than existing methods. Personalized PCA: Decoupling Shared and Unique Features http://jmlr.org/papers/v25/22-0810.html http://jmlr.org/papers/volume25/22-0810/22-0810.pdf 2024 Naichen Shi, Raed Al Kontar In this paper, we tackle a significant challenge in PCA: heterogeneity. When data are collected from different sources with heterogeneous trends while still sharing some congruency, it is critical to extract shared knowledge while retaining the unique features of each source. To this end, we propose personalized PCA (PerPCA), which uses mutually orthogonal global and local principal components to encode both unique and shared features. We show that, under mild conditions, both unique and shared features can be identified and recovered by a constrained optimization problem, even if the covariance matrices are immensely different. Also, we design a fully federated algorithm inspired by distributed Stiefel gradient descent to solve the problem. The algorithm introduces a new group of operations called generalized retractions to handle orthogonality constraints, and only requires global PCs to be shared across sources. We prove the linear convergence of the algorithm under suitable assumptions. Comprehensive numerical experiments highlight PerPCA's superior performance in feature extraction and prediction from heterogeneous datasets. As a systematic approach to decouple shared and unique features from heterogeneous datasets, PerPCA finds applications in several tasks, including video segmentation, topic extraction, and feature clustering. Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee http://jmlr.org/papers/v25/22-0667.html http://jmlr.org/papers/volume25/22-0667/22-0667.pdf 2024 George H. Chen Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On four standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive compared to various baselines tested in terms of time-dependent concordance index. Our code is available at: https://github.com/georgehc/survival-kernets On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control http://jmlr.org/papers/v25/21-1343.html http://jmlr.org/papers/volume25/21-1343/21-1343.pdf 2024 Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter $\alpha$, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index $\alpha$, a Hölder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Lévy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned. Convergence for nonconvex ADMM, with applications to CT imaging http://jmlr.org/papers/v25/21-0831.html http://jmlr.org/papers/volume25/21-0831/21-0831.pdf 2024 Rina Foygel Barber, Emil Y. Sidky The alternating direction method of multipliers (ADMM) algorithm is a powerful and flexible tool for complex optimization problems of the form $\min\{f(x)+g(y) : Ax+By=c\}$. ADMM exhibits robust empirical performance across a range of challenging settings including nonsmoothness and nonconvexity of the objective functions $f$ and $g$, and provides a simple and natural approach to the inverse problem of image reconstruction for computed tomography (CT) imaging. From the theoretical point of view, existing results for convergence in the nonconvex setting generally assume smoothness in at least one of the component functions in the objective. In this work, our new theoretical results provide convergence guarantees under a restricted strong convexity assumption without requiring smoothness or differentiability, while still allowing differentiable terms to be treated approximately if needed. We validate these theoretical results empirically, with a simulated example where both $f$ and $g$ are nondifferentiable---and thus outside the scope of existing theory---as well as a simulated CT image reconstruction problem. Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms http://jmlr.org/papers/v25/21-0316.html http://jmlr.org/papers/volume25/21-0316/21-0316.pdf 2024 T. Tony Cai, Hongji Wei Distributed estimation of a Gaussian mean under communication constraints is studied in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between communication costs and statistical accuracy, are established under the independent protocols. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under communication constraints, both in terms of the optimal procedure design and the lower bound argument. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. The optimality results and techniques developed in the present paper can be useful for solving other problems such as distributed nonparametric function estimation and sparse signal recovery. Sparse NMF with Archetypal Regularization: Computational and Robustness Properties http://jmlr.org/papers/v25/21-0233.html http://jmlr.org/papers/volume25/21-0233/21-0233.pdf 2024 Kayhan Behdin, Rahul Mazumder We consider the problem of sparse nonnegative matrix factorization (NMF) using archetypal regularization. The goal is to represent a collection of data points as nonnegative linear combinations of a few nonnegative sparse factors with appealing geometric properties, arising from the use of archetypal regularization. We generalize the notion of robustness studied in Javadi and Montanari (2019) (without sparsity) to the notions of (a) strong robustness that implies each estimated archetype is close to the underlying archetypes and (b) weak robustness that implies there exists at least one recovered archetype that is close to the underlying archetypes. Our theoretical results on robustness guarantees hold under minimal assumptions on the underlying data, and applies to settings where the underlying archetypes need not be sparse. We present theoretical results and illustrative examples to strengthen the insights underlying the notions of robustness. We propose new algorithms for our optimization problem; and present numerical experiments on synthetic and real data sets that shed further insights into our proposed framework and theoretical developments. Deep Network Approximation: Beyond ReLU to Diverse Activation Functions http://jmlr.org/papers/v25/23-0912.html http://jmlr.org/papers/volume25/23-0912/23-0912.pdf 2024 Shijun Zhang, Jianfeng Lu, Hongkai Zhao This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$. Effect-Invariant Mechanisms for Policy Generalization http://jmlr.org/papers/v25/23-0802.html http://jmlr.org/papers/volume25/23-0802/23-0802.pdf 2024 Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters Policy learning is an important component of many real-world learning systems. A major challenge in policy learning is how to adapt efficiently to unseen environments or tasks. Recently, it has been suggested to exploit invariant conditional distributions to learn models that generalize better to unseen environments. However, assuming invariance of entire conditional distributions (which we call full invariance) may be too strong of an assumption in practice. In this paper, we introduce a relaxation of full invariance called effect-invariance (e-invariance for short) and prove that it is sufficient, under suitable assumptions, for zero-shot policy generalization. We also discuss an extension that exploits e-invariance when we have a small sample from the test environment, enabling few-shot policy generalization. Our work does not assume an underlying causal graph or that the data are generated by a structural causal model; instead, we develop testing procedures to test e-invariance directly from data. We present empirical results using simulated data and a mobile health intervention dataset to demonstrate the effectiveness of our approach. Pygmtools: A Python Graph Matching Toolkit http://jmlr.org/papers/v25/23-0572.html http://jmlr.org/papers/volume25/23-0572/23-0572.pdf 2024 Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan Graph matching aims to find node-to-node matching among multiple graphs, which is a fundamental yet challenging problem. To facilitate graph matching in scientific research and industrial applications, pygmtools is released, which is a Python graph matching toolkit that implements a comprehensive collection of two-graph matching and multi-graph matching solvers, covering both learning-free solvers as well as learning-based neural graph matching solvers. Our implementation supports numerical backends including Numpy, PyTorch, Jittor, Paddle, runs on Windows, MacOS and Linux, and is friendly to install and configure. Comprehensive documentations covering beginner's guide, API reference and examples are available online. pygmtools is open-sourced under Mulan PSL v2 license. Heterogeneous-Agent Reinforcement Learning http://jmlr.org/papers/v25/23-0488.html http://jmlr.org/papers/volume25/23-0488/23-0488.pdf 2024 Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang The necessity for cooperation among intelligent machines has popularised cooperative multi-agent reinforcement learning (MARL) in AI research. However, many research endeavours heavily rely on parameter sharing among agents, which confines them to only homogeneous-agent setting and leads to training instability and lack of convergence guarantees. To achieve effective cooperation in the general heterogeneous-agent setting, we propose Heterogeneous-Agent Reinforcement Learning (HARL) algorithms that resolve the aforementioned issues. Central to our findings are the multi-agent advantage decomposition lemma and the sequential update scheme. Based on these, we develop the provably correct Heterogeneous-Agent Trust Region Learning (HATRL), and derive HATRPO and HAPPO by tractable approximations. Furthermore, we discover a novel framework named Heterogeneous-Agent Mirror Learning (HAML), which strengthens theoretical guarantees for HATRPO and HAPPO and provides a general template for cooperative MARL algorithmic designs. We prove that all algorithms derived from HAML inherently enjoy monotonic improvement of joint return and convergence to Nash Equilibrium. As its natural outcome, HAML validates more novel algorithms in addition to HATRPO and HAPPO, including HAA2C, HADDPG, and HATD3, which generally outperform their existing MA-counterparts. We comprehensively test HARL algorithms on six challenging benchmarks and demonstrate their superior effectiveness and stability for coordinating heterogeneous agents compared to strong baselines such as MAPPO and QMIX. Sample-efficient Adversarial Imitation Learning http://jmlr.org/papers/v25/23-0314.html http://jmlr.org/papers/volume25/23-0314/23-0314.pdf 2024 Dahuin Jung, Hyungyu Lee, Sungroh Yoon Imitation learning, in which learning is performed by demonstration, has been studied and advanced for sequential decision-making tasks in which a reward function is not predefined. However, imitation learning methods still require numerous expert demonstration samples to successfully imitate an expert's behavior. To improve sample efficiency, we utilize self-supervised representation learning, which can generate vast training signals from the given data. In this study, we propose a self-supervised representation-based adversarial imitation learning method to learn state and action representations that are robust to diverse distortions and temporally predictive, on non-image control tasks. In particular, in comparison with existing self-supervised learning methods for tabular data, we propose a different corruption method for state and action representations that is robust to diverse distortions. We theoretically and empirically observe that making an informative feature manifold with less sample complexity significantly improves the performance of imitation learning. The proposed method shows a 39% relative improvement over existing adversarial imitation learning methods on MuJoCo in a setting limited to 100 expert state-action pairs. Moreover, we conduct comprehensive ablations and additional experiments using demonstrations with varying optimality to provide insights into a range of factors. Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent http://jmlr.org/papers/v25/23-0220.html http://jmlr.org/papers/volume25/23-0220/23-0220.pdf 2024 Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi We propose new limiting dynamics for stochastic gradient descent in the small learning rate regime called stochastic modified flows. These SDEs are driven by a cylindrical Brownian motion and improve the so-called stochastic modified equations by having regular diffusion coefficients and by matching the multi-point statistics. As a second contribution, we introduce distribution dependent stochastic modified flows which we prove to describe the fluctuating limiting dynamics of stochastic gradient descent in the small learning rate - infinite width scaling regime. Rates of convergence for density estimation with generative adversarial networks http://jmlr.org/papers/v25/23-0062.html http://jmlr.org/papers/volume25/23-0062/23-0062.pdf 2024 Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove an oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate with a significantly better statistical error term compared to the previously known results. The advantage of our bound becomes clear in application to nonparametric density estimation. We show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta + d)}$, where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. This rate of convergence coincides (up to logarithmic factors) with minimax optimal for the considered class of densities. Additive smoothing error in backward variational inference for general state-space models http://jmlr.org/papers/v25/22-1392.html http://jmlr.org/papers/volume25/22-1392/22-1392.pdf 2024 Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff We consider the problem of state estimation in general state-space models using variational inference. For a generic variational family defined using the same backward decomposition as the actual joint smoothing distribution, we establish under mixing assumptions that the variational approximation of expectations of additive state functionals induces an error which grows at most linearly in the number of observations. This guarantee is consistent with the known upper bounds for the approximation of smoothing distributions using standard Monte Carlo methods. We illustrate our theoretical result with state-of-the art variational solutions based both on the backward parameterization and on alternatives using forward decompositions. This numerical study proposes guidelines for variational inference based on neural networks in state-space models. Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality http://jmlr.org/papers/v25/22-1296.html http://jmlr.org/papers/volume25/22-1296/22-1296.pdf 2024 Stephan Wojtowytsch In this note, we study how neural networks with a single hidden layer and ReLU activation interpolate data drawn from a radially symmetric distribution with target labels 1 at the origin and 0 outside the unit ball, if no labels are known inside the unit ball. With weight decay regularization and in the infinite neuron, infinite data limit, we prove that a unique radially symmetric minimizer exists, whose average parameters and Lipschitz constant grow as $d$ and $\sqrt{d}$ respectively. We furthermore show that the average weight variable grows exponentially in $d$ if the label $1$ is imposed on a ball of radius $\varepsilon$ rather than just at the origin. By comparison, a neural networks with two hidden layers can approximate the target function without encountering the curse of dimensionality. Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees http://jmlr.org/papers/v25/22-1170.html http://jmlr.org/papers/volume25/22-1170/22-1170.pdf 2024 Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks. On Tail Decay Rate Estimation of Loss Function Distributions http://jmlr.org/papers/v25/22-0846.html http://jmlr.org/papers/volume25/22-0846/22-0846.pdf 2024 Etrit Haxholli, Marco Lorenzi The study of loss-function distributions is critical to characterize a model's behaviour on a given machine-learning problem. While model quality is commonly measured by the average loss assessed on a testing set, this quantity does not ascertain the existence of the mean of the loss distribution. Conversely, the existence of a distribution's statistical moments can be verified by examining the thickness of its tails. Cross-validation schemes determine a family of testing loss distributions conditioned on the training sets. By marginalizing across training sets, we can recover the overall (marginal) loss distribution, whose tail-shape we aim to estimate. Small sample-sizes diminish the reliability and efficiency of classical tail-estimation methods like Peaks-Over-Threshold, and we demonstrate that this effect is notably significant when estimating tails of marginal distributions composed of conditional distributions with substantial tail-location variability. We mitigate this problem by utilizing a result we prove: under certain conditions, the marginal-distribution's tail-shape parameter is the maximum tail-shape parameter across the conditional distributions underlying the marginal. We label the resulting approach as `cross-tail estimation (CTE)'. We test CTE in a series of experiments on simulated and real data showing the improved robustness and quality of tail estimation as compared to classical approaches. Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces http://jmlr.org/papers/v25/22-0719.html http://jmlr.org/papers/volume25/22-0719/22-0719.pdf 2024 Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao Learning operators between infinitely dimensional spaces is an important learning task arising in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively. Post-Regularization Confidence Bands for Ordinary Differential Equations http://jmlr.org/papers/v25/22-0487.html http://jmlr.org/papers/volume25/22-0487/22-0487.pdf 2024 Xiaowu Dai, Lexin Li Ordinary differential equation (ODE) is an important tool to study a system of biological and physical processes. A central question in ODE modeling is to infer the significance of individual regulatory effect of one signal variable on another. However, building confidence band for ODE with unknown regulatory relations is challenging, and it remains largely an open question. In this article, we construct the post-regularization confidence band for the individual regulatory function in ODE with unknown functionals and noisy data observations. Our proposal is the first of its kind, and is built on two novel ingredients. The first is a new localized kernel learning approach that combines reproducing kernel learning with local Taylor approximation, and the second is a new de-biasing method that tackles infinite-dimensional functionals and additional measurement errors. We show that the constructed confidence band has the desired asymptotic coverage probability, and the recovered regulatory network approaches the truth with probability tending to one. We establish the theoretical properties when the number of variables in the system can be either smaller or larger than the number of sampling time points, and we study the regime-switching phenomenon. We demonstrate the efficacy of the proposed method through both simulations and illustrations with two data applications. On the Generalization of Stochastic Gradient Descent with Momentum http://jmlr.org/papers/v25/22-0068.html http://jmlr.org/papers/volume25/22-0068/22-0068.pdf 2024 Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang While momentum-based accelerated variants of stochastic gradient descent (SGD) are widely used when training machine learning models, there is little theoretical understanding on the generalization error of such methods. In this work, we first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded. Then, for smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes, and show that it can train machine learning models for multiple epochs with a guarantee for generalization. Finally, for the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes. Extending our results on generalization, we also develop an upper bound on the expected true risk, in terms of the number of training steps, sample size, and momentum. Our experimental evaluations verify the consistency between the numerical results and our theoretical bounds. SGDEM improves the generalization error of SGDM when training ResNet-18 on ImageNet in practical distributed settings. Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension http://jmlr.org/papers/v25/21-1190.html http://jmlr.org/papers/volume25/21-1190/21-1190.pdf 2024 Shotaro Yagishita, Jun-ya Gotoh Network lasso (NL for short) is a technique for estimating models by simultaneously clustering data samples and fitting the models to them. It often succeeds in forming clusters thanks to the geometry of the sum of $\ell_2$ norm employed therein, but there may be limitations due to the convexity of the regularizer. This paper focuses on clustering generated by NL and strengthens it by creating a non-convex extension, called network trimmed lasso (NTL for short). Specifically, we initially investigate a sufficient condition that guarantees the recovery of the latent cluster structure of NL on the basis of the result of Sun et al. (2021) for convex clustering, which is a special case of NL for ordinary clustering. Second, we extend NL to NTL to incorporate a cardinality (or, $\ell_0$-)constraint and rewrite the constrained optimization problem defined with the $\ell_0$ norm, a discontinuous function, into an equivalent unconstrained continuous optimization problem. We develop ADMM algorithms to solve NTL and show their convergence results. Numerical illustrations indicate that the non-convex extension provides a more clear-cut cluster structure when NL fails to form clusters without incorporating prior knowledge of the associated parameters. Iterate Averaging in the Quest for Best Test Error http://jmlr.org/papers/v25/21-1125.html http://jmlr.org/papers/volume25/21-1125/21-1125.pdf 2024 Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena from our theoretical results: (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved generalisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results, together with empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures. Nonparametric Inference under B-bits Quantization http://jmlr.org/papers/v25/20-075.html http://jmlr.org/papers/volume25/20-075/20-075.pdf 2024 Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang Statistical inference based on lossy or incomplete samples is often needed in research areas such as signal/image processing, medical image storage, remote sensing, signal transmission. In this paper, we propose a nonparametric testing procedure based on samples quantized to $B$ bits through a computationally efficient algorithm. Under mild technical conditions, we establish the asymptotic properties of the proposed test statistic and investigate how the testing power changes as $B$ increases. In particular, we show that if $B$ exceeds a certain threshold, the proposed nonparametric testing procedure achieves the classical minimax rate of testing (Shang and Cheng, 2015) for spline models. We further extend our theoretical investigations to a nonparametric linearity test and an adaptive nonparametric test, expanding the applicability of the proposed methods. Extensive simulation studies {together with a real-data analysis} are used to demonstrate the validity and effectiveness of the proposed tests. Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box http://jmlr.org/papers/v25/23-1015.html http://jmlr.org/papers/volume25/23-1015/23-1015.pdf 2024 Ryan Giordano, Martin Ingram, Tamara Broderick Automatic differentiation variational inference (ADVI) offers fast and easy-to-use posterior approximation in multiple modern probabilistic programming languages. However, its stochastic optimizer lacks clear convergence criteria and requires tuning parameters. Moreover, ADVI inherits the poor posterior uncertainty estimates of mean-field variational Bayes (MFVB). We introduce "deterministic ADVI" (DADVI) to address these issues. DADVI replaces the intractable MFVB objective with a fixed Monte Carlo approximation, a technique known in the stochastic optimization literature as the "sample average approximation" (SAA). By optimizing an approximate but deterministic objective, DADVI can use off-the-shelf second-order optimization, and, unlike standard mean-field ADVI, is amenable to more accurate posterior covariances via linear response (LR). In contrast to existing worst-case theory, we show that, on certain classes of common statistical problems, DADVI and the SAA can perform well with relatively few samples even in very high dimensions, though we also show that such favorable results cannot extend to variational approximations that are too expressive relative to mean-field ADVI. We show on a variety of real-world problems that DADVI reliably finds good solutions with default settings (unlike ADVI) and, together with LR covariances, is typically faster and more accurate than standard ADVI. On Sufficient Graphical Models http://jmlr.org/papers/v25/23-0893.html http://jmlr.org/papers/volume25/23-0893/23-0893.pdf 2024 Bing Li, Kyongwon Kim We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting. Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond http://jmlr.org/papers/v25/23-0661.html http://jmlr.org/papers/volume25/23-0661/23-0661.pdf 2024 Nathan Kallus, Xiaojie Mao, Masatoshi Uehara We consider estimating a low-dimensional parameter in an estimating equation involving high-dimensional nuisance functions that depend on the target parameter as an input. A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated. Existing approaches based on flexibly estimating the nuisances and plugging in the estimates, such as debiased machine learning (DML), require we learn the nuisance at all possible inputs. For (L)QTE, DML requires we learn the whole covariate-conditional cumulative distribution function. We instead propose localized debiased machine learning (LDML), which avoids this burdensome step and needs only estimate nuisances at a single initial rough guess for the target parameter. For (L)QTE, LDML involves learning just two regression functions, a standard task for machine learning methods. We prove that under lax rate conditions our estimator has the same favorable asymptotic behavior as the infeasible estimator that uses the unknown true nuisances. Thus, LDML notably enables practically-feasible and theoretically-grounded efficient estimation of important quantities in causal inference such as (L)QTEs when we must control for many covariates and/or flexible relationships, as we demonstrate in empirical studies. On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks http://jmlr.org/papers/v25/23-0549.html http://jmlr.org/papers/volume25/23-0549/23-0549.pdf 2024 Sebastian Neumayer, Lénaïc Chizat, Michael Unser In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized from zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with nonzero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal-transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite-dimensional convex counterpart. We formulate the corresponding functional-optimization problem and investigate its main properties. In particular, we show that, as the scale of the initialization ranges between $0$ and $+\infty$, the associated path interpolates continuously between the so-called kernel and rich regimes. Numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly, even beyond these extreme points. Improving physics-informed neural networks with meta-learned optimization http://jmlr.org/papers/v25/23-0356.html http://jmlr.org/papers/volume25/23-0356/23-0356.pdf 2024 Alex Bihlo We show that the error achievable using physics-informed neural networks for solving differential equations can be substantially reduced when these networks are trained using meta-learned optimization methods rather than using fixed, hand-crafted optimizers as traditionally done. We choose a learnable optimization method based on a shallow multi-layer perceptron that is meta-trained for specific classes of differential equations. We illustrate meta-trained optimizers for several equations of practical relevance in mathematical physics, including the linear advection equation, Poisson's equation, the Korteweg-de Vries equation and Burgers' equation. We also illustrate that meta-learned optimizers exhibit transfer learning abilities, in that a meta-trained optimizer on one differential equation can also be successfully deployed on another differential equation. A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent http://jmlr.org/papers/v25/23-0237.html http://jmlr.org/papers/volume25/23-0237/23-0237.pdf 2024 Stefan Ankirchner, Stefan Perko Applying a stochastic gradient descent (SGD) method for minimizing an objective gives rise to a discrete-time process of estimated parameter values. In order to better understand the dynamics of the estimated values, many authors have considered continuous-time approximations of SGD. We refine existing results on the weak error of first-order ODE and SDE approximations to SGD for non-infinitesimal learning rates. In particular, we explicitly compute the linear term in the error expansion of gradient flow and two of its stochastic counterparts, with respect to a discretization parameter $h$. In the example of linear regression, we demonstrate the general inferiority of the deterministic gradient flow approximation in comparison to the stochastic ones, for batch sizes which are not too large. Further, we demonstrate that for Gaussian features an SDE approximation with state-independent noise (CC) is preferred over using a state-dependent coefficient (NCC). The same comparison holds true for features of low kurtosis or large batch sizes. However, the relationship reverses for highly leptokurtic features or small batch sizes. Critically Assessing the State of the Art in Neural Network Verification http://jmlr.org/papers/v25/23-0119.html http://jmlr.org/papers/volume25/23-0119/23-0119.pdf 2024 Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn Recent research has proposed various methods to formally verify neural networks against minimal input perturbations; this verification task is also known as local robustness verification. The research area of local robustness verification is highly diverse, as verifiers rely on a multitude of techniques, including mixed integer programming and satisfiability modulo theories. At the same time, the problem instances encountered when performing local robustness verification differ based on the network to be verified, the property to be verified and the specific network input. This raises the question of which verification algorithm is most suitable for solving specific types of instances of the local robustness verification problem. To answer this question, we performed a systematic performance analysis of several CPU- and GPU-based local robustness verification systems on a newly and carefully assembled set of 79 neural networks, of which we verified a broad range of robustness properties, while taking a practitioner's point of view -- a perspective that complements the insights from initiatives such as the VNN competition, where the participating tools are carefully adapted to the given benchmarks by their developers. Notably, we show that no single best algorithm dominates performance across all verification problem instances. Instead, our results reveal complementarities in verifier performance and illustrate the potential of leveraging algorithm portfolios for more efficient local robustness verification. We quantify this complementarity using various performance measures, such as the Shapley value. Furthermore, we confirm the notion that most algorithms only support ReLU-based networks, while other activation functions remain under-supported. Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design http://jmlr.org/papers/v25/22-1396.html http://jmlr.org/papers/volume25/22-1396/22-1396.pdf 2024 Arya Akhavan, Davit Gogolashvili, Alexandre B. Tsybakov We propose a new method for estimating the minimizer $\boldsymbol{x}^*$ and the minimum value $f^*$ of a smooth and strongly convex regression function $f$ from the observations contaminated by random noise. Our estimator $\boldsymbol{z}_n$ of the minimizer $\boldsymbol{x}^*$ is based on a version of the projected gradient descent with the gradient estimated by a regularized local polynomial algorithm. Next, we propose a two-stage procedure for estimation of the minimum value $f^*$ of regression function $f$. At the first stage, we construct an accurate enough estimator of $\boldsymbol{x}^*$, which can be, for example, $\boldsymbol{z}_n$. At the second stage, we estimate the function value at the point obtained in the first stage using a rate optimal nonparametric procedure. We derive non-asymptotic upper bounds for the quadratic risk and optimization risk of $\boldsymbol{z}_n$, and for the risk of estimating $f^*$. We establish minimax lower bounds showing that, under certain choice of parameters, the proposed algorithms achieve the minimax optimal rates of convergence on the class of smooth and strongly convex functions. Modeling Random Networks with Heterogeneous Reciprocity http://jmlr.org/papers/v25/22-1317.html http://jmlr.org/papers/volume25/22-1317/22-1317.pdf 2024 Daniel Cirkovic, Tiandong Wang Reciprocity, or the tendency of individuals to mirror behavior, is a key measure that describes information exchange in a social network. Users in social networks tend to engage in different levels of reciprocal behavior. Differences in such behavior may indicate the existence of communities that reciprocate links at varying rates. In this paper, we develop methodology to model the diverse reciprocal behavior in growing social networks. In particular, we present a preferential attachment model with heterogeneous reciprocity that imitates the attraction users have for popular users, plus the heterogeneous nature by which they reciprocate links. We compare Bayesian and frequentist model fitting techniques for large networks, as well as computationally efficient variational alternatives. Cases where the number of communities is known and unknown are both considered. We apply the presented methods to the analysis of Facebook and Reddit networks where users have non-uniform reciprocal behavior patterns. The fitted model captures the heavy-tailed nature of the empirical degree distributions in the datasets and identifies multiple groups of users that differ in their tendency to reply to and receive responses to wallposts and comments. Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment http://jmlr.org/papers/v25/22-1251.html http://jmlr.org/papers/volume25/22-1251/22-1251.pdf 2024 Zixian Yang, Xin Liu, Lei Ying The traditional multi-armed bandit (MAB) model for recommendation systems assumes the user stays in the system for the entire learning horizon. In new online education platforms such as ALEKS or new video recommendation systems such as TikTok, the amount of time a user spends on the app depends on how engaging the recommended contents are. Users may temporarily leave the system if the recommended items cannot engage the users. To understand the exploration, exploitation, and engagement in these systems, we propose a new model, called MAB-A where “A” stands for abandonment and the abandonment probability depends on the current recommended item and the user's past experience (called state). We propose two algorithms, ULCB and KL-ULCB, both of which do more exploration (being optimistic) when the user likes the previous recommended item and less exploration (being pessimistic) when the user does not. We prove that both ULCB and KL-ULCB achieve logarithmic regret, $O(\log K)$, where $K$ is the number of visits (or episodes). Furthermore, the regret bound under KL-ULCB is asymptotically sharp. We also extend the proposed algorithms to the general-state setting. Simulation results show that the proposed algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms. On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models http://jmlr.org/papers/v25/22-1120.html http://jmlr.org/papers/volume25/22-1120/22-1120.pdf 2024 Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh In this paper, we focus on the computation of the nonparametric maximum likelihood estimator (NPMLE) in multivariate mixture models. Our approach discretizes this infinite dimensional convex optimization problem by setting fixed support points for the NPMLE and optimizing over the mixing proportions. We propose an efficient and scalable semismooth Newton based augmented Lagrangian method (ALM). Our algorithm outperforms the state-of-the-art methods (Kim et al., 2020; Koenker and Gu, 2017), capable of handling $n \approx 10^6$ data points with $m \approx 10^4$ support points. A key advantage of our approach is its strategic utilization of the solution's sparsity, leading to structured sparsity in Hessian computations. As a result, our algorithm demonstrates better scaling in terms of $m$ when compared to the mixsqp method (Kim et al., 2020). The computed NPMLE can be directly applied to denoising the observations in the framework of empirical Bayes. We propose new denoising estimands in this context along with their consistent estimates. Extensive numerical experiments are conducted to illustrate the efficiency of our ALM. In particular, we employ our method to analyze two astronomy data sets: (i) Gaia-TGAS Catalog (Anderson et al., 2018) containing approximately $1.4 \times 10^6$ data points in two dimensions, and (ii) a data set from the APOGEE survey (Majewski et al., 2017) with approximately $2.7 \times 10^4$ data points. Decorrelated Variable Importance http://jmlr.org/papers/v25/22-0801.html http://jmlr.org/papers/volume25/22-0801/22-0801.pdf 2024 Isabella Verdinelli, Larry Wasserman Because of the widespread use of black box prediction methods such as random forests and neural nets, there is renewed interest in developing methods for quantifying variable importance as part of the broader goal of interpretable prediction. A popular approach is to define a variable importance parameter --- known as LOCO (Leave Out COvariates) --- based on dropping covariates from a regression model. This is essentially a nonparametric version of $R^2$. This parameter is very general and can be estimated nonparametrically, but it can be hard to interpret because it is affected by correlation between covariates. We propose a method for mitigating the effect of correlation by defining a modified version of LOCO. This new parameter is difficult to estimate nonparametrically, but we show how to estimate it using semiparametric models. Model-Free Representation Learning and Exploration in Low-Rank MDPs http://jmlr.org/papers/v25/22-0687.html http://jmlr.org/papers/volume25/22-0687/22-0687.pdf 2024 Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal The low-rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low-rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments. Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method http://jmlr.org/papers/v25/22-0402.html http://jmlr.org/papers/volume25/22-0402/22-0402.pdf 2024 Ernesto Araya, Guillaume Braun, Hemant Tyagi In the graph matching problem we observe two graphs $G,H$ and the goal is to find an assignment (or matching) between their vertices such that some measure of edge agreement is maximized. We assume in this work that the observed pair $G,H$ has been drawn from the Correlated Gaussian Wigner (CGW) model -- a popular model for correlated weighted graphs -- where the entries of the adjacency matrices of $G$ and $H$ are independent Gaussians and each edge of $G$ is correlated with one edge of $H$ (determined by the unknown matching) with the edge correlation described by a parameter $\sigma \in [0,1)$. In this paper, we analyse the performance of the projected power method (PPM) as a seeded graph matching algorithm where we are given an initial partially correct matching (called the seed) as side information. We prove that if the seed is close enough to the ground-truth matching, then with high probability, PPM iteratively improves the seed and recovers the ground-truth matching (either partially or exactly) in $O(\log n)$ iterations. Our results prove that PPM works even in regimes of constant $\sigma$, thus extending the analysis in (Mao et al., 2023) for the sparse Correlated Erdos-Renyi (CER) model to the (dense) CGW model. As a byproduct of our analysis, we see that the PPM framework generalizes some of the state-of-art algorithms for seeded graph matching. We support and complement our theoretical findings with numerical experiments on synthetic data. Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization http://jmlr.org/papers/v25/21-1205.html http://jmlr.org/papers/volume25/21-1205/21-1205.pdf 2024 Shicong Cen, Yuting Wei, Yuejie Chi This paper investigates the problem of computing the equilibrium of competitive games in the form of two-player zero-sum games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE)---which are solutions to zero-sum two-player matrix games with entropy regularization---at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence. Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic http://jmlr.org/papers/v25/21-1137.html http://jmlr.org/papers/volume25/21-1137/21-1137.pdf 2024 Zheng Tracy Ke, Jun S. Liu, Yucong Ma The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR. Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction http://jmlr.org/papers/v25/21-0264.html http://jmlr.org/papers/volume25/21-0264/21-0264.pdf 2024 Yuze Han, Guangzeng Xie, Zhihua Zhang In this paper we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider a so-called proximal incremental first-order oracle (PIFO) algorithm, which employs the individual component function's gradient and proximal information provided by PIFO to update the variable. To incorporate loopless methods, we also allow the PIFO algorithm to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles. On Truthing Issues in Supervised Classification http://jmlr.org/papers/v25/19-301.html http://jmlr.org/papers/volume25/19-301/19-301.pdf 2024 Jonathan K. Su Ideal supervised classification assumes known correct labels, but various truthing issues can arise in practice: noisy labels; multiple, conflicting labels for a sample; missing labels; and different labeler combinations for different samples. Previous work introduced a noisy-label model, which views the observed noisy labels as random variables conditioned on the unobserved correct labels. It has mainly focused on estimating the conditional distribution of the noisy labels and the class prior, as well as estimating the correct labels or training with noisy labels. In a complementary manner, given the conditional distribution and class prior, we apply estimation theory to classifier testing, training, and comparison of different combinations of labelers. First, for binary classification, we construct a testing model and derive approximate marginal posteriors for accuracy, precision, recall, probability of false alarm, and F-score, and joint posteriors for ROC and precision-recall analysis. We propose minimum mean-square error (MMSE) testing, which employs empirical Bayes algorithms to estimate the testing-model parameters and then computes optimal point estimates and credible regions for the metrics. We extend the approach to multi-class classification to obtain optimal estimates of accuracy and individual confusion-matrix elements. Second, we present a unified view of training that covers probabilistic (i.e., discriminative or generative) and non-probabilistic models. For the former, we adjust maximum-likelihood or maximum a posteriori training for truthing issues; for the latter, we propose MMSE training, which minimizes the MMSE estimate of the empirical risk. We also describe suboptimal training that is compatible with existing infrastructure. Third, we observe that mutual information lets one express any labeler combination as an equivalent single labeler, implying that multiple mediocre labelers can be as informative as, or more informative than, a single expert labeler. Experiments demonstrate the effectiveness of the methods and confirm the implication.