Removing the influence of a group variable in high-dimensional predictive modelling
Predictive modelling relies on the assumption that observations used for training are representative of the data that will be encountered in future samples. In a variety of applications, this assumption is severely violated, since observational training data are often collected under sampling processes which are systematically biased with respect to group membership. Without explicit adjustment, machine learning algorithms can produce predictions that have poor generalization error with performance that varies widely by group. We propose a method to pre-process the training data, producing an adjusted dataset that is independent of the group variable with minimum information loss. We develop a conceptually simple approach for creating such a set of features in high dimensional settings based on a constrained form of principal components analysis. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be independent of the group variable. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two real examples: removing machine-specific correlations from brain scan data, and removing race and ethnicity information from a dataset used to predict recidivism.
A systematic investigation of classical causal inference strategies under mis-specification due to network interference
We systematically investigate issues due to mis-specification that arise in estimating causal effects when (treatment) interference is informed by a network available pre-intervention, i.e., in situations where the outcome of a unit may depend on the treatment assigned to other units. We develop theory for several forms of interference through the concept of exposure neighborhood, and develop the corresponding semi-parametric representation for potential outcomes as a function of the exposure neighborhood. Using this representation, we extend the definition of two popular classes of causal estimands, marginal and average causal effects, to the case of network interference. We characterize the bias and variance one incurs when combining classical randomization strategies (namely, Bernoulli, Completely Randomized, and Cluster Randomized designs) and estimators (namely, difference-in-means and Horvitz-Thompson) used to estimate average treatment effect and on the total treatment effect, under misspecification due to interference. We illustrate how difference-in-means estimators can have arbitrarily large bias when estimating average causal effects, depending on the form and strength of interference, which is unknown at design stage. Horvitz-Thompson (HT) estimators are unbiased when the correct weights are specified. Here, we derive the HT weights for unbiased estimation of different estimands, and illustrate how they depend on the design, the form of interference, which is unknown at design stage, and the estimand. More importantly, we show that HT estimators are in-admissible for a large class of randomization strategies, in the presence of interference. We develop new model-assisted and model-dependent strategies to improve HT estimators, and we develop new randomization strategies for estimating the average treatment effect and total treatment effect.
BabyAI: First Steps Towards Grounded Language Learning With a Human In the Loop
Allowing humans to interactively train artificial agents to understand language instructions is desirable for both practical and scientific reasons, but given the poor data efficiency of the current learning methods, this goal may require substantial research efforts. Here, we introduce the BabyAI research platform to support investigations towards including humans in the loop for grounded language learning. The BabyAI platform comprises an extensible suite of 19 levels of increasing difficulty. The levels gradually lead the agent towards acquiring a combinatorially rich synthetic language which is a proper subset of English. The platform also provides a heuristic expert agent for the purpose of simulating a human teacher. We report baseline results and estimate the amount of human involvement that would be required to train a neural network-based agent on some of the BabyAI levels. We put forward strong evidence that current deep learning methods are not yet sufficiently sample efficient when it comes to learning a language with compositional properties.
A similarity measure for second order properties of non-stationary functional time series with applications to clustering and testing
Due to the surge of data storage techniques, the need for the development of appropriate techniques to identify patterns and to extract knowledge from the resulting enormous data sets, which can be viewed as collections of dependent functional data, is of increasing interest in many scientific areas. We develop a similarity measure for spectral density operators of a collection of functional time series, which is based on the aggregation of Hilbert-Schmidt differences of the individual time-varying spectral density operators. Under fairly general conditions, the asymptotic properties of the corresponding estimator are derived and asymptotic normality is established. The introduced statistic lends itself naturally to quantify (dis)-similarity between functional time series, which we subsequently exploit in order to build a spectral clustering algorithm. Our algorithm is the first of its kind in the analysis of non-stationary (functional) time series and enables to discover particular patterns by grouping together `similar’ series into clusters, thereby reducing the complexity of the analysis considerably. The algorithm is simple to implement and computationally feasible. As a further application we provide a simple test for the hypothesis that the second order properties of two non-stationary functional time series coincide.
Stochastic Primal-Dual Q-Learning
In this work, we present a new model-free and off-policy reinforcement learning (RL) algorithm, that is capable of finding a near-optimal policy with state-action observations from arbitrary behavior policies. Our algorithm, called the stochastic primal-dual Q-learning (SPD Q-learning), hinges upon a new linear programming formulation and a dual perspective of the standard Q-learning. In contrast to previous primal-dual RL algorithms, the SPD Q-learning includes a Q-function estimation step, thus allowing to recover an approximate policy from the primal solution as well as the dual solution. We prove a first-of-its-kind result that the SPD Q-learning guarantees a certain convergence rate, even when the state-action distribution is time-varying but sub-linearly converges to a stationary distribution. Numerical experiments are provided to demonstrate the off-policy learning abilities of the proposed algorithm in comparison to the standard Q-learning.
Unsupervised Anomalous Data Space Specification
Computer algorithms are written with the intent that when run they perform a useful function. Typically any information obtained is unknown until the algorithm is run. However, if the behavior of an algorithm can be fully described by precomputing just once how this algorithm will respond when executed on any input, this precomputed result provides a complete specification for all solutions in the problem domain. We apply this idea to a previous anomaly detection algorithm, and in doing so transform it from one that merely detects individual anomalies when asked to discover potentially anomalous values, into an algorithm also capable of generating a complete specification for those values it would deem to be anomalous. This specification is derived by examining no more than a small training data, can be obtained in very small constant time, and is inherently far more useful than results obtained by repeated execution of this tool. For example, armed with such a specification one can ask how close an anomaly is to being deemed normal, and can validate this answer not by exhaustively testing the algorithm but by examining if the specification so generated is indeed correct. This powerful idea can be applied to any algorithm whose runtime behavior can be recovered from its construction and so has wide applicability.
Heteroskedastic PCA: Algorithm, Optimality, and Applications
Principal component analysis (PCA) and singular value decomposition (SVD) are widely used in statistics, machine learning, and applied mathematics. It has been well studied in the case of homoskedastic noise, where the noise levels of the contamination are homogeneous. In this paper, we consider PCA and SVD in the presence of heteroskedastic noise, which arises naturally in a range of applications. We introduce a general framework for heteroskedastic PCA and propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries to remove the bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on the singular subspace, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of applications, including heteroskedastic low-rank matrix denoising, Poisson PCA, and SVD based on heteroskedastic and incomplete data.
Sequenced-Replacement Sampling for Deep Learning
We propose sequenced-replacement sampling (SRS) for training deep neural networks. The basic idea is to assign a fixed sequence index to each sample in the dataset. Once a mini-batch is randomly drawn in each training iteration, we refill the original dataset by successively adding samples according to their sequence index. Thus we carry out replacement sampling but in a batched and sequenced way. In a sense, SRS could be viewed as a way of performing ‘mini-batch augmentation’. It is particularly useful for a task where we have a relatively small images-per-class such as CIFAR-100. Together with a longer period of initial large learning rate, it significantly improves the classification accuracy in CIFAR-100 over the current state-of-the-art results. Our experiments indicate that training deeper networks with SRS is less prone to over-fitting. In the best case, we achieve an error rate as low as 10.10%.
Learning Multi-Layer Transform Models
Learned data models based on sparsity are widely used in signal processing and imaging applications. A variety of methods for learning synthesis dictionaries, sparsifying transforms, etc., have been proposed in recent years, often imposing useful structures or properties on the models. In this work, we focus on sparsifying transform learning, which enjoys a number of advantages. We consider multi-layer or nested extensions of the transform model, and propose efficient learning algorithms. Numerical experiments with image data illustrate the behavior of the multi-layer transform learning algorithm and its usefulness for image denoising. Multi-layer models provide better denoising quality than single layer schemes.
AdaPtive Noisy Data Augmentation (PANDA) for Simultaneous Construction Multiple Graph Models
We extend the data augmentation technique (PANDA) by Li et al. (2018) for regularizing single graph model estimations to jointly learning the structures of multiple graphs. Our proposed approach provides an unified framework to effectively jointly train multiple graphical models, regardless of the types of nodes. We design and introduce two types of noises to augment the observed data. The first type of noises is to regularize the estimation of each graph while the second type of noises promotes either the structural similarities, referred as the joint group lasso (JGL) regularization, or numerical similarities, referred as the joint fused ridge (JFR) regularization, among the edges in the same position across multiple graphs. The computation in PANDA is straightforward and only involves obtaining maximum likelihood estimator in generalized linear models (GLMs) in an iterative manner. We also extend the JGL and JFR regularization beyond the graphical model settings to variable selection and estimation in GLMs. The multiple graph version of PANDA enjoys the theoretical properties established for single graphs including the almost sure (a.s) convergence of the noise-augmented loss function to its expectation and the a.s convergence of the minimizer of the former to the minimizer of the latter. The simulation studies suggest PANDA is non-inferior to existing joint estimation approaches in constructing multiple Gaussian graphical models (GGMs), and significantly improves over the differencing approach over separately estimated graphs in multiple Poisson graphical models (PGMs). We also applied PANDA to a real-life lung cancer microarray data to simultaneously construct four protein networks.
Triadic time series motifs
A Framework for Robot Programming in Cobotic Environments: First user experiments
The increasing presence of robots in industries has not gone unnoticed. Large industrial players have incorporated them into their production lines, but smaller companies hesitate due to high initial costs and the lack of programming expertise. In this work we introduce a framework that combines two disciplines, Programming by Demonstration and Automated Planning, to allow users without any programming knowledge to program a robot. The user teaches the robot atomic actions together with their semantic meaning and represents them in terms of preconditions and effects. Using these atomic actions the robot can generate action sequences autonomously to reach any goal given by the user. We evaluated the usability of our framework in terms of user experiments with a Baxter Research Robot and showed that it is well-adapted to users without any programming experience.
Probabilistic Matrix Factorization with Personalized Differential Privacy
Probabilistic matrix factorization (PMF) plays a crucial role in recommendation systems. It requires a large amount of user data (such as user shopping records and movie ratings) to predict personal preferences, and thereby provides users high-quality recommendation services, which expose the risk of leakage of user privacy. Differential privacy, as a provable privacy protection framework, has been applied widely to recommendation systems. It is common that different individuals have different levels of privacy requirements on items. However, traditional differential privacy can only provide a uniform level of privacy protection for all users. In this paper, we mainly propose a probabilistic matrix factorization recommendation scheme with personalized differential privacy (PDP-PMF). It aims to meet users’ privacy requirements specified at the item-level instead of giving the same level of privacy guarantees for all. We then develop a modified sampling mechanism (with bounded differential privacy) for achieving PDP. We also perform a theoretical analysis of the PDP-PMF scheme and demonstrate the privacy of the PDP-PMF scheme. In addition, we implement the probabilistic matrix factorization schemes both with traditional and with personalized differential privacy (DP-PMF, PDP-PMF) and compare them through a series of experiments. The results show that the PDP-PMF scheme performs well on protecting the privacy of each user and its recommendation quality is much better than the DP-PMF scheme.
Bayesian Distance Clustering
Model-based clustering is widely-used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distance-based clustering; Mixture model; Model-based clustering; Model misspecification; Pairwise distance matrix; Partial likelihood; Robustness.
Forecasting Time Series with VARMA Recursions on Graphs
Graph-based techniques emerged as a choice to deal with the dimensionality issues in modeling multivariate time series. However, there is yet no complete understanding of how the underlying structure could be exploited to ease this task. This work provides contributions in this direction by considering the forecasting of a process evolving over a graph. We make use of the (approximate) time-vertex stationarity assumption, i.e., timevarying graph signals whose first and second order statistical moments are invariant over time and correlated to a known graph topology. The latter is combined with VAR and VARMA models to tackle the dimensionality issues present in predicting the temporal evolution of multivariate time series. We find out that by projecting the data to the graph spectral domain: (i) the multivariate model estimation reduces to that of fitting a number of uncorrelated univariate ARMA models and (ii) an optimal low-rank data representation can be exploited so as to further reduce the estimation costs. In the case that the multivariate process can be observed at a subset of nodes, the proposed models extend naturally to Kalman filtering on graphs allowing for optimal tracking. Numerical experiments with both synthetic and real data validate the proposed approach and highlight its benefits over state-of-the-art alternatives.
• Statistical Treatment of Inverse Problems Constrained by Differential Equations-Based Models with Stochastic Terms• Power Minimization in Distributed Antenna Systems using Non-Orthogonal Multiple Access and Mutual Successive Interference Cancellation• Prediction of treatment outcome for autism from structure of the brain based on sure independence screening• EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge• The exponentiated xgammma distribution: Estimation and its application• Recovery of Saturated $γ$ Signal Waveforms by Artificial Neural Networks• New and Simplified Distributed Algorithms for Weighted All Pairs Shortest Paths• Investigating many-body mobility edges in isolated quantum systems• Well, how accurate is it? A Study of Deep Learning Methods for Reynolds-Averaged Navier-Stokes Simulations• Asymptotic Properties for Methods Combining Minimum Hellinger Distance Estimates and Bayesian Nonparametric Density Estimates• Micro-Browsing Models for Search Snippets• MRI Reconstruction via Cascaded Channel-wise Attention Network• On the unimodality of convolutions of sequences of binomial coefficients• Semantic Integration in the Information Flow Framework• Large-scale Hierarchical Alignment for Author Style Transfer• Uniform, nonparametric, non-asymptotic confidence sequences• Fair Cake-Cutting in Practice• Development of a Convex Programming Model for Optimal Power Flow Problems• On Optimal Sensing and Capacity Trade-off in Cognitive Radio Systems with Directional Antennas• Containing all permutations• Global Defensive Alliances in the Lexicographic Product of Paths and Cycles• Merge: An Architecture for Interconnected Testbed Ecosystems• Too Many Hats• Quantile Regression Under Memory Constraint• Dynamic Coupling and Damping Injection for Connectivity-Preserving Swarm Teleoperation• Lower bounds for fluctuations in first-passage percolation for general distributions• FPT algorithms to recognize well covered graphs• Interpolating between Optimal Transport and MMD using Sinkhorn Divergences• A Comparison of the Trojan Y Chromosome Strategy to Harvesting Models for Eradication of Non-Native Species• Exploring Adversarial Examples in Malware Detection• Log-symmetric regression models for correlated errors with an application to mortality data• Performance Improvement in Noisy Linear Consensus Networks with Time-Delay• Deep Learning vs. Human Graders for Classifying Severity Levels of Diabetic Retinopathy in a Real-World Nationwide Screening Program• CURE-OR: Challenging Unreal and Real Environments for Object Recognition• Compositional Verification for Autonomous Systems with Deep Learning Components• Open Vocabulary Learning on Source Code with a Graph-Structured Cache• Reduction of Parameter Redundancy in Biaffine Classifiers with Symmetric and Circulant Weight Matrices• Dyson Brownian Motion for General $β$ and Potential at the Edge• An Efficient Data Retrieval Parallel Reeb Graph Algorithm• Zero-Forcing Per-Group Precoding (ZF-PGP) for Robust Optimized Downlink Massive MIMO Performance• Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD• Minor-closed graph classes with bounded layered pathwidth• A Comparative Analysis of Registration Tools: Traditional vs Deep Learning Approach on High Resolution Tissue Cleared Data• Ordered Size Ramsey Number of Paths• Domain-Invariant Projection Learning for Zero-Shot Recognition• Transferrable Feature and Projection Learning with Class Hierarchy for Zero-Shot Learning• Zero and Few Shot Learning with Semantic Feature Synthesis and Competitive Learning• Regime-Switching Jump Diffusions with Non-Lipschitz Coefficients and Countably Many Switching States: Existence and Uniqueness, Feller, and Strong Feller Properties• Multilinear SVD for Millimeter Wave Channel Parameter Estimation• A note on spanning trees of connected $K_{1,t}$-free graphs whose stems have a few leaves• Optimal hedging under fast-varying stochastic volatility• Multi-Domain Pose Network for Multi-Person Pose Estimation and Tracking• Flow Network Models for Online Scheduling Real-time Tasks on Multiprocessors• A Matrix Chernoff Bound for Strongly Rayleigh Distributions and Spectral Sparsifiers from a few Random Spanning Trees• On the lower bounds of Davenport constant• Treatment Effect Models with Strategic Interaction in Treatment Decisions• Exchangeability and Kernel Invariance in Trained MLPs• Super-pixel cloud detection using Hierarchical Fusion CNN• Malicious Web Domain Identification using Online Credibility and Performance Data by Considering the Class Imbalance Issue• Linear Shrinkage Estimation of Covariance Matrices Using Low-Complexity Cross-Validation• Generative Low-Shot Network Expansion• Noise reinforcement for L{é}vy processes• A Two-Layer Distributed Control Method for Islanded Networked Microgrid Systems• On the Poincar{é} constant of log-concave measures• Temporal Action Detection by Joint Identification-Verification• Saliency guided deep network for weakly-supervised image segmentation• Invocation-driven Neural Approximate Computing with a Multiclass-Classifier and Multiple Approximators• A highly scalable and energy-efficient artificial neuron using an Ovonic Threshold Switch (OTS) featuring the spike-frequency adaptation and chaotic activity• QANet: Tensor Decomposition Approach for Query-based Anomaly Detection in Heterogeneous Information Networks• QBDT, a new boosting decision tree method with systematic uncertainties into training for High Energy Physics• Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments• Impact of Corpora Quality on Neural Machine Translation• DGC-Net: Dense Geometric Correspondence Network• STACL: Simultaneous Translation with Integrated Anticipation and Controllable Latency• Approximation of rectifiable $1$-currents and weak-$\ast$ relaxation of the $h$-mass• Towards Efficient Large-Scale Graph Neural Network Computing• Flow Motifs in Interaction Networks• CVABS: Moving Object Segmentation with Common Vector Approach for Videos• Bilu-Linial stability, certified algorithms and the Independent Set problem• On the structure of spikes• Characterizations of indicator functions of fractional factorial designs• Probabilistic forecasting and simulation of electricity prices• Quantum Internet: Networking Challenges in Distributed Quantum Computing• Lévy flights confinement in a parabolic potential and fractional quantum oscillator• ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch• Fast Graph-Cut Based Optimization for Practical Dense Deformable Registration of Volume Images• Assumption-Based Planning• Convolutional group-sparse coding and source localization• Efficient Dependency-Guided Named Entity Recognition• Learning with privileged information via adversarial discriminative modality distillation• Coordinated exploration for labyrinthine environments with application to the Pursuit-Evasion problem• NOMA Schemes for Multibeam Satellite Communications• High Resolution Semantic Change Detection• Planification par fusions incrémentales de graphes• Fully Convolutional Siamese Networks for Change Detection• Urban Change Detection for Multispectral Earth Observation Using Convolutional Neural Networks• On the optimality of likelihood ratio test for prospect theory based binary hypothesis testing• From Louvain to Leiden: guaranteeing well-connected communities• Families of Markov chains with compatible symmetric-group actions• Improving Fast Segmentation With Teacher-student Learning• Data analysis from empirical moments and the Christoffel function• On the complexity of color-avoiding site and bond percolation• On El Karoui’s General Theory of Optimal Stopping• Beta-rhythm oscillations and synchronization transition in network models of Izhikevich neurons: effect of topology and synaptic type• On a Stochastic Representation Theorem for Meyer-measurable Processes and its Applications in Stochastic Optimal Control and Optimal Stopping• Modelling information flow in stochastic optimal control: How Meyer-$σ$-fields settle the clash between exogenous and endogenous jumps• Complex Hadamard matrices attached to a 3-class nonsymmetric association scheme• Data-driven Analysis of Complex Networks and their Model-generated Counterparts• Hybrid deep neural networks for all-cause Mortality Prediction from LDCT Images• Alphabet-Dependent Bounds for Linear Locally Repairable Codes Based on Residual Codes• Transfer Learning versus Multi-agent Learning regarding Distributed Decision-Making in Highway Traffic• Normality of Necessary Optimality Conditions for Calculus of Variations Problems with State Constraints• An indirect numerical method for a time-optimal state-constrained control problem in a steady two-dimensional fluid flow• MsCGAN: Multi-scale Conditional Generative Adversarial Networks for Person Image Generation• Fairness for Whom? Critically reframing fairness with Nash Welfare Product• Nonlinear integro-differential operator regression with neural networks• Federated Learning in Distributed Medical Databases: Meta-Analysis of Large-Scale Subcortical Brain Data• HeartBEAT: Heart Beat Estimation through Adaptive Tracking• Global minima for optimal control of the obstacle problem• Long time behavior of a mean-field model of interacting neurons• Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks• Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking• Weak Semi-Markov CRFs for NP Chunking in Informal Text• Finite Volume Simulation Framework for Die Casting with Uncertainty Quantification• Security-Reliability Tradeoff for Distributed Antenna Systems in Heterogeneous Cellular Networks• Supervising strong learners by amplifying weak experts• Conceptual Organization is Revealed by Consumer Activity Patterns• Leveraging Product as an Activation Function in Deep Networks• Learning to Recognize Discontiguous Entities• Memory formation in matter• Remote sensing to reduce the effects of spatial autocorrelation on design-based inference for forest inventory using systematic samples• A Modern Take on the Bias-Variance Tradeoff in Neural Networks• Transition probabilities for infinite two-sided loop-erased random walks• False Discovery and Its Control in Low Rank Estimation• Template-Based Image Reconstruction from Sparse Tomographic Data• Detecting cities in aerial night-time images by learning structural invariants using single reference augmentation• Mainumby: un Ayudante para la Traducción Castellano-Guaraní
Like this:
Like Loading…
Related