# MIFODS Seminar

MIFODS events in the next two weeks.MIFODS runs a joint seminar with other seminars at MIT: The Stochastics and Statistics seminar, the LIDS seminar, the ToC colloquium, the Machine Learning seminar, the Applied Math colloquium and the Combinatorics seminar.

## Seminars

All seminars in chronological order

## Afonso Bandeira (NYU)

#### Statistical estimation under group actions: The Sample Complexity of Multi-Reference Alignment

Many problems in signal/image processing, and computer vision amount to estimating a signal, image, or tri-dimensional structure/scene from corrupted measurements. A particularly challenging form of measurement corruption are latent transformations of the underlying signal to be recovered. Many such transformations can be described as a group acting on the object to be recovered. Examples include the Simulatenous Localization and Mapping (SLaM) problem in Robotics and Computer Vision, where pictures of a scene are obtained from different positions andorientations; Cryo-Electron Microscopy (Cryo-EM) imaging where projections of a molecule density are taken from unknown rotations, andseveral others. One fundamental example of this type of problems is Multi-Reference Alignment: Given a group acting in a space, the goal is to estimate an orbit of the group action from noisy samples. For example, in one of its simplest forms, one is tasked with estimating a signal from noisy cyclically shifted copies. We will show that the number of observations needed by any method has a surprising dependency on the signal-to-noise ratio (SNR), and algebraic properties of the underlying group action. Remarkably, in some important cases, this sample complexity is achieved with computationally efficient methods based on computing invariants under the group of transformations.

## Justin Solomon (MIT)

#### Scaling and Broadening the Scope of Computational Transport

The theory of optimal transport links probability to geometry, providing a natural means to characterize, compare, and manipulate probability distributions supported on a geometric domain. While optimal transport has long been identified as a candidate for improving pipelines in computer vision, graphics, statistics, and other disciplines, computational considerations have made it challenging to apply in practice; recent improvements in the computational pipeline, however, have motivated renewed interest in this discipline among members of the computer science community. In this talk, I will discuss recent efforts jointly with collaborators and students in the MIT Geometric Data Processing Group to develop efficient algorithms for computational optimal transport and derived problems, leveraging structure in key applications of this machinery. I will highlight new techniques for stochastic transport, transport along discrete surfaces, and transport along graphs, as well as extensions of the basic problem to tensor-valued measures and to quadratic assignment problems.

## David Sontag (MIT)

#### When Inference is Tractable

A key capability of artificial intelligence will be the ability to reason about abstract concepts and draw inferences. Where data is limited, probabilistic inference in graphical models provides a powerful framework for performing such reasoning, and can even be used as modules within deep architectures. But, when is probabilistic inference computationally tractable? I will present recent theoretical results that substantially broaden the class of provably tractable models by exploiting model stability (Lang, Sontag, Vijayaraghavan, AI Stats ’18), structure in model parameters (Weller, Rowland, Sontag, AI Stats ’16), and reinterpreting inference as ground truth recovery (Globerson, Roughgarden, Sontag, Yildirim, ICML ’15).

#### Is Depth Needed for Deep Learning?

Deep learning, as its name indicates, is based on training artificial neural networks with many layers. A key theoretical question is to understand why such depth is beneficial, and when is it provably necessary to express certain types of functions. In fact, this question is closely related to circuit complexity, which has long been studied in theoretical computer science -- albeit for different reasons, and for circuits which differ in some important ways from modern neural networks. Despite this similarity, the interaction between the circuit complexity and machine learning research communities is currently quite limited. In this talk, I'll survey some of the recent depth separation results developed in the machine learning community, and discuss open questions where insights from circuit complexity might help. The talk is aimed at a general theoretical computer science audience, and no prior knowledge about deep learning will be assumed.

## Johannes Schmidt-Hieber (Leiden)

#### Statistical theory for deep neural networks with ReLU activation function

The universal approximation theorem states that neural networks are capable of approximating any continuous function up to a small error that depends on the size of the network. The expressive power of a network does, however, not guarantee that deep networks perform well on data. For that, control of the statistical estimation risk is needed. In the talk, we derive statistical theory for fitting deep neural networks to data generated from the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to logarithmic factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential parameters being much bigger than the sample size. Interestingly, the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that scaling the network depth with the logarithm of the sample size is natural.

## Sanjoy Dasgupta (UCSD)

#### Using interaction for simpler and better learning

In the usual setup of supervised learning, the learner is given a stack of labeled examples and told to fit a classifier to them. It would be quite unnatural for a human to learn in this way, and indeed this model is known to suffer from a variety of fundamental hardness barriers. However, many of these hurdles can be overcome by moving to a setup in which the learner interacts with a human (or other information source) during the learning process. We will see how interaction makes it possible to: 1. Learn DNF (disjunctive normal form) concepts. 2. Perform machine teaching in situations where the student’s concept class is unknown. 3. Improve the results of unsupervised learning. We will present a generic approach to “interactive structure learning” that, for instance, yields simple interactive algorithms for topic modeling and hierarchical clustering. Along the way, we will present a novel cost function for hierarchical clustering, as well as an efficient algorithm for approximately minimizing this cost.

## Carlos Fernandez-Granda (NYU)

#### Sparse Recovery Beyond Compressed Sensing

Recovering sparse signals from underdetermined linear measurements is a challenging problem. Deconvolution in reflection seismology and imaging, source localization in EEG, estimation of relaxation parameters in MRI, and direction-of-arrival estimation in radar can all be reformulated as sparse inverse problems. Convex-programming methods based on l1-norm minimization are widely applied to tackle such problems in practice, but current theoretical guarantees are mostly focused on randomized measurement operators that are not relevant to these applications. In this talk, we present a theoretical framework to analyze these methods for realistic deterministic operators, which yields exact- recovery guarantees under certain conditions on the signal support that are related to the correlation structure of the linear operator.

## Shai Shalev-Shwartz (MobileEye)

#### Autonomous driving: From a science project to mass production

In recent years, car makers and tech companies have been racing towards self driving cars. It seems that the main parameter in this race is who will have the first car on the road. The goal of the talk is to add to the equation two additional crucial parameters. The first is standardization of safety assurance --- what are the minimal requirements that every self-driving car must satisfy, and how can we verify these requirements. The second parameter is scalability --- engineering solutions that lead to unleashed costs will not scale to millions of cars, which will push interest in this field into a niche academic corner, and drive the entire field into a "winter of autonomous driving". In the first part of the talk I will show why statistical guarantees give a very weak safety and will propose instead a white-box, interpretable, mathematical model for safety assurance, which we call Responsibility-Sensitive Safety (RSS). Time permitting, the second part of the talk will involve reinforcement learning techniques for scalable driving policy.

## Dejan Slepcev (CMU)

#### Variational problems on random structures and their continuum limits

We will discuss variational problems arising in machine learning and their limits as the number of data points goes to infinity. Consider point clouds obtained as random samples of an underlying “ground-truth” measure. Graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points. Many machine learning tasks, such as clustering and semi-supervised learning, can be posed as minimizing functionals on such graphs. We consider functionals involving graph cuts, graph laplacians and their limits as the number of data points goes to infinity. We will discuss the limits of functionals when the number of data points goes to infinity. In particular we establish under what conditions the minimizers of discrete problems have a well defined continuum limit.

## Moritz Hardt (UC Berkeley)

#### When Recurrent Models Don't Need To Be Recurrent

We show that stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Our result applies to a broad range of non-linear recurrent neural networks under a natural stability condition, which we observe is also necessary. Complementing our theoretical findings, we verify the conclusions of our theory on both real and synthetic tasks. Furthermore, we demonstrate recurrent models satisfying the stability assumption of our theory can have excellent performance on real sequence learning tasks. Joint work with John Miller (UC Berkeley).

#### Improving Generative Adversarial Networks using Game Theory and Statistics

Generative Adversarial Networks (aka GANs) are a recently proposed approach for learning samplers of high-dimensional distributions with intricate structure, such as distributions over natural images, given samples from these distributions. They are obtained by setting up a two-player zero-sum game between two neural networks, which learn statistics of a target distribution by adapting their strategies in the game using gradient descent. Despite their intriguing performance in practice, GANs pose great challenges to both optimization and statistics. Their training suffers from oscillations, and they are difficult to scale to high-dimensional settings. We study how game-theoretic and statistical techniques can be brought to bare on these important challenges. We use Game Theory towards improving GAN training, and Statistics towards scaling up the dimensionality of the generated distributions.

## Suresh Venkatasubramanian (Utah)

#### Towards a theory (or theories) of fairness in automated decision-making

With the advent of automated decision making has come a whole new vocabulary for algorithm design. We say that automated decision-making should be "fair" or "unbiased" or "nondiscriminatory". But our algorithms speak only mathematics, and these words need translation (if at all they can be translated). Over the last few years, the research area of fairness, accountability and transparency has made attempts to formalize what it means for an algorithm to be fair. In this talk I'll discuss what has become a conversation between society and belief systems on the one hand, and mathematical formalisms on the other. I'll explain the different formalizations of fairness and how they reflect differing perspectives on how society understands fairness and bias. On the other hand, I'll describe how the formalizations lead to conclusions about the possibility (or not) of fair decision-making.

## Tselil Schramm (Harvard+MIT)

#### Efficient Algorithms for the Graph Matching Problem in Correlated Random Graphs

The Graph Matching problem is a robust version of the Graph Isomorphism problem: given two not-necessarily-isomorphic graphs, the goal is to find a permutation of the vertices which maximizes the number of common edges. We study a popular average-case variant; we deviate from the common heuristic strategy and give the first quasi-polynomial time algorithm, where previously only sub-exponential time algorithms were known.
Based on joint work with Boaz Barak, Chi-Ning Chou, Zhixian Lei, and Yueqi Sheng.

## John Duchi (Stanford)

#### Locally private estimation, learning, inference, and optimality

In this talk, we investigate statistical learning and estimation under local privacy constraints, where data providers do not trust the collector of the data and so privatize their data before it is even collected. We identify fundamental tradeoffs between statistical utility and privacy in such local models of privacy, providing instance-specific bounds for private estimation and learning problems by developing local minimax risks. In contrast to approaches based on worst-case (minimax) error, which are conservative, this allows us to evaluate the difficulty of individual problem instances and delineate the possibilities for adaptation in private estimation and inference. As part of this, we identify an alternative to the Fisher information for private estimation, giving a more nuanced understanding of the challenges of adaptivity and optimality. We also provide optimal procedures for private inference, highlighting the importance of a more careful development of optimal tradeoffs between estimation and privacy. One consequence of our results is to identify settings where standard local privacy restrictions may be too strong for practice; time permitting, I will then discuss a few new directions that maintain limited amounts of privacy while simultaneously allowing the development of high-performance statistical and learning procedures.
Based on joint work with Feng Ruan.

## Emma Brunskill (Stanford)

#### Towards Better Reinforcement Learning for High Stakes Domains

There is increasing excitement about reinforcement learning-- a subarea of machine learning for enabling an agent to learn to make good decisions. Yet numerous questions and challenges remain for reinforcement learning to help support progress in important high stakes domains like education, consumer marketing and healthcare. One key question is how to leverage the ever expanding logs of sequences of decisions made and their outcomes to identify better decision policies, such as using electronic medical record data to inform better personalized patient treatments. I will discuss some of our work on developing better statistical estimators and optimizers for this problem, from the vantage point of a reinforcement learning researcher. A second key issue is to narrow the gap between RL theory and experiment, to create online reinforcement learning algorithms with strong guarantees on their performance. In this effort, I will briefly describe our recent work on policy certificates for RL algorithms. This work is a step towards providing the types of guarantees and transparency that will enable us to confidently deploy RL in important high stakes scenarios.

## Parikshit Gopalan (VMware)

#### Good visualizations have short sketches

We consider the problem of interactively visualizing a distributed tabular dataset with billions of rows. Our goal is to allow a user to browse such datasets in realtime, at any resolution, with just a few mouse clicks (much like navigating in an online map service) .
We hypothesize that the sketching model of computation is tailor-made for this setting: interesting visualizations are amenable to efficient sketching protocols, whose communication cost is bounded by the size of the desired output (which is small, since it fits on screen). In this talk, we present Hillview, an open-source tool for interactive visualization of large datasets, built around this premise. We focus on the algorithmic challenges that arise in trying to render common visualizations in the sketching model.
Based on joint work with Mihai Budiu, Udi Weider, Marcos Aguilera, Lalith Suresh and Han Kruiger (all from VMware research).

## Emmanuel Abbe (EPFL)

#### When Deep Learning Fails at Learning

As the success of deep learning reaches more grounds, one would like to also envision the potential limits of deep learning. This talk gives a first set of results proving that deep learning algorithms fail at learning certain kinds of functions in poly-time, focusing on the classical problem of parities. It further points out to information measures that quantify when these failures tend to generalize, and the tradeoffs between memory, noise and initialization. Joint work with Colin Sandon.

## Nike Sun (MIT)

#### Capacity Lower Bound for the Ising Perceptron

The perceptron is a toy model of a simple neural network that stores a collection of given patterns. Its analysis reduces to a simple problem in high-dimensional geometry, namely, understanding the intersection of the cube (or sphere) with a collection of random half-spaces. Despite the simplicity of this model, its high-dimensional asymptotics are not well understood. I will describe what is known and present recent results.

## Zico Kolter (CMU)

#### Making Deep Learning More Robust, Modular, and Efficient

Deep learning is often seen as the "breakthrough" AI technology of recent years, revolutionizing areas spanning computer vision, natural language processing, and game playing. However, if we seek to deploy such systems in real-world, safety-critical domains, a starker reality emerges. Modern deep learning systems are brittle (sensitive to adversarial manipulation and a general lack of robustness), opaque (difficult to interpret and debug their components), and expensive (often requiring vastly more data than practical in real-world settings).

These failings are sometimes billed as an argument against deep learning as a whole. But in this talk, I will argue instead for new methods that can address these challenges, while preserving the fundamental benefits of deep learning (namely, end-to-end training of composable, differentiable architectures). First, I will discuss our approaches to designing provably robust deep networks using tools from convex relaxations and duality. I also highlight recent work on scaling these methods to much larger domains, including some initial work on provable robustness at ImageNet scale. Second, I will present our work on integrating more complex modules as interpretable layers within deep learning architectures. I show how modules such as optimization solvers, physical simulation, model-based control, and game equilibrium solvers can all be integrated as layers within a deep network, enabling more intuitive architectures that can learn from vastly less data. Last, I will highlight some additional ongoing directions and open questions in both these areas.

## Sam Hopkins (Berkeley)

#### How to Estimate the Mean of a Random Vector in Polynomial Time

We study polynomial time algorithms for estimating the mean of a multivariate random vector under very mild assumptions: we assume only that the random vector X has finite mean and covariance. This allows for X to be heavy-tailed. In this setting, the radius of confidence intervals achieved by the empirical mean are exponentially larger in the case that X is Gaussian or sub-Gaussian. That is, the empirical mean is poorly concentrated. We offer the first polynomial time algorithm to estimate the mean of X with sub-Gaussian-size confidence intervals under such mild assumptions. That is, our estimators are exponentially better-concentrated than the empirical mean. Our algorithm is based on a new semidefinite programming relaxation of a high-dimensional median. Previous estimators which assumed only existence of finitely-many moments of X either sacrifice sub-Gaussian performance or are only known to be computable via brute-force search procedures requiring time exponential in the dimension.

Based on https://arxiv.org/abs/1809.07425, to appear in Annals of Statistics.

#### Exponential Line-crossing Inequalities

This talk will present a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We will illustrate this point by presenting a single assumption and a single theorem that together strengthen many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner, Pinelis, Blackwell, van de Geer, and de la Pena; and several modern inequalities (post-2000) by Khan, Tropp, Bercu and Touati, Delyon, and others. In each of these cases, we give the strongest and most general statements to date, quantifying the time-uniform concentration of scalar, matrix, and Banach-space-valued martingales, under a variety of nonparametric assumptions in discrete and continuous time. In doing so, we bridge the gap between existing line-crossing inequalities, the sequential probability ratio test, the Cramer-Chernoff method, self-normalized processes, and other parts of the literature. Time permitting, I will briefly discuss applications to sequential covariance matrix estimation, adaptive clinical trials and multi-armed bandits via the notion of “confidence sequences”.

(joint work with Steve Howard, Jas Sekhon and Jon McAuliffe, preprint https://arxiv.org/abs/1808.03204)

#### Towards Transparency in AI, Methods and Challenges

Automated decision making tools are currently used in high stakes scenarios. From natural language processing tools used to automatically determine one’s suitability for a job, to health diagnostic systems trained to determine a patient’s outcome, machine learning models are used to make decisions that can have serious consequences on people’s lives. In spite of the consequential nature of these use cases, vendors of such models are not required to perform specific tests showing the suitability of their models for a given task. Nor are they required to provide documentation describing the characteristics of their models, or disclose the results of algorithmic audits to ensure that certain groups are not unfairly treated. I will show some examples to examine the dire consequences of basing decisions entirely on machine learning based systems, and discuss recent work on auditing and exposing the gender and skin tone bias found in commercial gender classification systems. I will end with the concept of AI datasheets for datasets and model cards for model reporting to standardize information for datasets and pre-trained models, in order to push the field as a whole towards transparency and accountability. Recently, we have seen many powerful entities in academia and industry announcing initiatives related to AI ethics. I will spend some time in this talk discussing how we can learn from the mistakes and evolutions of other disciplines who have/continue to perform what some call parachute research: one that uses the pain of marginalized communities without centering their voices or benefiting them.

## Yoram Singer (Princeton)

#### Memory-Efficient Adaptive Optimization for Humungous-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in modern machine learning. These methods maintain second-order statistics of each model parameter, thus doubling the memory footprint of the optimizer. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training some of the largest deep models used at Google.

## Will Perkins (University of Chicago)

#### Counting and Sampling at Low Temperatures

We consider the problem of efficient sampling from the hard-core and Potts models from statistical physics. On certain families of graphs, phase transitions in the underlying physics model are linked to changes in the performance of some sampling algorithms, including Markov chains. We develop new sampling and counting algorithms that exploit the phase transition phenomenon and work efficiently on lattices (and bipartite expander graphs) at sufficiently low temperatures in the phase coexistence regime. Our algorithms are based on Pirogov-Sinai theory and the cluster expansion, classical tools from statistical physics. Joint work with Tyler Helmuth and Guus Regts.

## Mona Singh (Princeton)

#### Algorithms for deciphering disease networks

Networks of molecular interactions underlie virtually all functions executed within a cell. Networks thus provide a powerful foundation within which to interpret a wide range of rapidly accumulating biological data. In this talk, I will present formulations and algorithms that leverage the structure and function of biological networks in order to gain insights into diseases such as cancer. First, I will introduce a framework that can rapidly integrate multiple sources of information about molecular functionality in order to discover key interactions within a network that tend to be disrupted in cancers. Crucially, our approach is based on analytical calculations that obviate the need to perform time-prohibitive permutation-based significance tests. Second, I will describe algorithms that map prior and newly collected data onto network nodes in order to uncover disease-relevant subnetworks. Finally, I will demonstrate the effectiveness of our methods in the cancer context. Overall, our work showcases the versatility and power of a network viewpoint in advancing biomedical discovery. (No biology background is required for this talk.)

#### Frontiers of Efficient Neural-Network Learnability

What are the most expressive classes of neural networks that can be learned, provably, in polynomial-time in a distribution-free setting? In this talk we give the first efficient algorithm for learning neural networks with two nonlinear layers using tools for solving isotonic regression, a nonconvex (but tractable) optimization problem. If we further assume the distribution is symmetric, we obtain the first efficient algorithm for recovering the parameters of a one-layer convolutional network. These results implicitly make use of a convex surrogate loss for generalized linear models and go beyond the kernel-method/overparameterization regime used in recent works.

#### Does Learning Require Memorization? A Short Tale about a Long Tail

Learning algorithms based on deep neural networks are well-known to (nearly) perfectly fit the training set and fit well even the random labels. The reasons for this tendency to memorize the labels of the training data are not well understood. We provide a simple model for prediction problems in which such memorization is necessary for achieving close-to-optimal generalization error. In our model, data is sampled from a mixture of subpopulations and the frequencies of these subpopulations are chosen from some prior. Our analysis demonstrates that memorization becomes necessary whenever the frequency prior is long-tailed. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. We complement the theoretical results with experiments on several standard benchmarks showing that memorization is an essential part of deep learning. Based on https://arxiv.org/abs/1906.05271 and an ongoing work with Chiyuan Zhang.

## Kamalika Chaudhuri (UCSD)

#### Challenges in Reliable Machine Learning

As machine learning is increasingly used in real applications, there is a need for reliable and robust methods. In this talk, we will discuss two such challenges that arise in reliable machine learning. The first is sample selection bias, where training data is available from a distribution conditioned on a sample selection policy, but the resultant classifier needs to be evaluated on the entire population. We will show how we can use active learning to get a small amount of labeled data from the entire population that can be used to correct this kind of sample selection bias. The second is robustness to adversarial examples -- slight strategic perturbations of legitimate test inputs that cause misclassification. We next look at adversarial examples in the context of a simple non-parametric classifier -- the k-nearest neighbor classifier, and look at its robustness properties. We provide bounds on its robustness as a function of k, and propose a more robust 1-nearest neighbor classifier. Joint work with Songbai Yan, Tara Javidi, Yaoyuan Yang, Cyrus Rastchian, Yizhen Wang and Somesh Jha.

## David Duvenaud (University of Toronto)

#### Neural Stochastic Differential Equations for Sparsely-sampled Time Series

Much real-world data is sampled at irregular intervals, but most time series models require regularly-sampled data. Continuous-time latent variables models can handle address this problem, but until now only deterministic models, such as latent ODEs, were efficiently trainable by backprop. We generalize the adjoint sensitivities method to SDEs, constructing an SDE that runs backwards in time and computes all necessary gradients, along with a general algorithm that allows SDEs to be trained by backpropgation with constant memory cost. We also give an efficient algorithm for gradient-based stochastic variational inference in function space, all with the use of adaptive black-box SDE solvers. Finally, we'll show initial results of applying latent SDEs to time series data, and discuss prototypes of infinitely-deep Bayesian neural networks.

## Moses Charikar (Stanford)

#### Approximating Profile Maximum Likelihood Efficiently: New Bounds on the Bethe Permanent

Symmetric properties of distributions arise in multiple settings. For each of these, separate estimators and analysis techniques have been developed. Recently, Orlitsky et al. showed that a single estimator that maximizes profile maximum likelihood (PML) is sample competitive for all symmetric properties. Further, they showed that even a $2^{n^{1-\delta}}$-approximate maximizer of the PML objective can serve as such a universal plug-in estimator. (Here n is the size of the sample). Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known. We provide the first such estimator and show how to compute it in time nearly linear in n. The PML objective is related to the permanent of a certain Vandermonde matrix. A surprising connection between the convex relaxation we introduce and the Bethe free energy approximation originating in statistical physics leads to new bounds on the Bethe permanent of non-negative matrices.

## Lenka Zdeborova (Institut de Physique Théorique, CNRS)

#### Understanding machine learning with statistical physics

The affinity between statistical physics and machine learning has long history, this is reflected even in the machine learning terminology that is in part adopted from physics. Current theoretical challenges and open questions about deep learning and statistical learning call for unified account of the following three ingredients: (a) the dynamics of the learning algorithm, (b) the architecture of the neural networks, and (c) the structure of the data. Most existing theories are not taking in account all of those three aspects in a satisfactory manner. In this talk I will describe some of the results stemming from statistical physics applied to machine learning and how it does include the three ingredients, although in a very simplified manner. Then I will focus on the current results improving our modelling in each of the three aspects covering recent articles.

## Sujay Sanghavi (UT Austin)

#### Towards Model Agnostic Robustness

It is now common practice to try and solve machine learning problems by starting with a complex existing model or architecture, and fine-tuning/adapting it to the task at hand. However, outliers, errors or even just sloppiness in training data often lead to drastic drops in performance. We investigate a simple generic approach to correct for this, motivated by a classic statistical idea: trimmed loss. This advocates jointly (a) selecting which training samples to ignore, and (b) fitting a model on the remaining samples. As such this is computationally infeasible even for linear regression. We propose and study the natural iterative variant that alternates between these two steps (a) and (b) - each of which individually can be easily accomplished in pretty much any statistical setting. We also study the batch-SGD variant of this idea. We demonstrate both theoretically (for generalized linear models) and empirically (for vision and NLP neural network models) that this effectively recovers accuracy in the presence of bad training data. This work is joint with Yanyao Shen and Vatsal Shah and appears in NeurIPS 2019 and ICML 2019.

## Mark Sellke (Stanford)

#### Chasing Convex Bodies.

I will explain our recent understanding of the chasing convex bodies problem posed by Friedman and Linial in 1993. In this problem, a player receives a request sequence K1,...,KT of convex sets in d dimensional space and moves online into each requested set. The player's movement cost is the length of the resulting path. Chasing convex bodies asks to find an online algorithm with cost competitive against the offline (in hindsight) optimal path. This is both an challenging metrical task system and (equivalent to) a competitive analysis view on online convex optimization. This problem was open for d \geq 2 until recently but has since been solved twice. The first solution gives a 2^{O(d)} competitive algorithm while the second gives a nearly optimal min(d,(\sqrt{d}log(T))) competitive algorithm for T requests. The latter result is based on the Steiner point, which is the exact optimal solution to a related geometric problem called Lipschitz selection and dates from 1840. I will briefly outline the first solution and fully explain the second. Partially based on joint work with Sébastien Bubeck, Boaz Klartag, Yin Tat Lee, and Yuanzhi Li.

TBA

TBA

TBA

TBA

TBA

TBA

TBA

TBA

TBA