Fall 2018

Non-convex optimization and deep learning


Since its introduction in the early twentieth century, maximum likelihood estimation has been the predominant guiding principle in the design of effective methods in statistics and machine learning. Combined with regularization, this principle has led to significant advances in several fundamental problems such as high-dimensional linear regression, learning graphical models, matrix completion, and various questions in multivariate analysis. This success was chiefly due to the development of convex relaxations, which, in turn, enabled the deployment of the powerful machinery from convex programming. Nevertheless, the modeling task has been largely dissociated from algorithmic design and convex relaxations have been the preponderant link between the two tasks. As statisticians and data scientists have started to process larger amounts of data that call for more sophisticated models, this approach become less viable recently. A prime example resides in latent variable models that are often very powerful from statistical point of view as they are able to capture heterogeneous and subtle trends in data. This power comes, however, at a price of making the corresponding maximum likelihood estimation tasks be large-scale and highly non-convex, putting it beyond the grasp of our current algorithmic understanding. This forced the researchers to resort to heuristics, such as expectation maximization, that are not always reliable and whose guarantees are still poorly understood apart from specific examples where local convexity has played a key role. The goal of this theme is to explore and rethink the interface of statistics and optimization, broadly construed. On one hand, it aims to examine and unify the current landscape of rigorous algorithmic approaches to classic maximum likelihood estimations tasks, such as matrix completion, learning mixtures of Gaussians, and learning graphical models. On the other hand, it will examine a new way of thinking about statistical modeling at large. That is, modeling in which the choice of the model is driven both by the statistical properties of that model and the computational nature of the corresponding maximum likelihood estimation task. The fundamental relationship to investigate here would be: “how does the choice of the latent model affect the structure and complexity of the resulting optimization problem”? A key topic to study in this context will be deep neural network models and its algorithmic workhorse: back-propagation and stochastic gradient descent. These models are currently playing a dominant role in many machine learning scenarios, especially, in computer vision, translation, and robotics. Despite a spectacular success of these methods in practice, both our statistical and algorithmic understanding remains very recent and confined to toy examples. Deep neural networks also constitute an extremely rich design space for statistical models and regularization techniques as well as for algorithms for solving the corresponding training tasks. These tasks can be viewed as highly non-convex and large-scale maximum likelihood estimation problems.