Click on a talk title for details. To receive talk announcements by email, sign up for our mailing list. In return, please forward announcements of ML-related talks to announce (at) ml.jhu.edu.

Tue 03/06/18, 10:45am, Hackerman Hall B17**Testing and Repairing Machine Learning Systems in Adversarial Environment***Yinzhi Cao, Lehigh University*

**Abstract:** Machine learning (ML) systems are increasingly deployed in safety- and security-critical domains such as self-driving cars and malware detection, where the system correctness for corner case inputs are crucial. Existing testing of ML system correctness depends heavily on manually labeled data and therefore often fails to expose erroneous behaviors for rare inputs. In this talk, I will present the first framework to test and repair ML systems, especially in an adversarial environment. In the first part, I will introduce DeepXplore, a whitebox testing framework of real-world deep learning (DL) systems. Our evaluation shows that DeepXplore can successfully find thousands of erroneous corner case behaviors, e.g., self-driving cars crashing into guard rails and malware masquerading as benign software. In the second part, I will introduce machine unlearning, a general, efficient approach to repair an ML system exhibiting erroneous behaviors. Our evaluation, on four diverse learning systems and real-world workloads, shows that machine unlearning is general, effective, fast, and easy to use.

**Bio:** Yinzhi Cao is an assistant professor at Lehigh University. He earned his Ph.D. in Computer Science at Northwestern University and worked at Columbia University as a postdoc. Before that, he obtained his B.E. degree in Electronics Engineering at Tsinghua University in China. His research mainly focuses on the security and privacy of the Web, smartphones, and machine learning. He has published many papers at various security and system conferences, such as IEEE S&P (Oakland), NDSS, CCS, and SOSP. His JShield system has been adopted by Huawei, the world’s largest telecommunication company. His past work was widely featured by over 30 media outlets, such as NSF Science Now (Episode 38), CCTV News, IEEE Spectrum, Yahoo! News and ScienceDaily. He received two best paper awards at SOSP’17 and IEEE CNS’15 respectively. He is one of the recipients of 2017 Amazon Research Awards (ARA).

(an ML talk in the CS Speaker Series)

Mon 03/05/18, 12:15pm, BSPH, Room W2008**A Mixed Modeling Framework for Analyzing Multitask Whole-Brain Network Data***Sean Simpson, Wake Forest School of Medicine*

**Abstract:** The emerging area of brain network analysis considers the brain as a system, providing profound clinical insight into links between system-level properties and behavioral and health outcomes. Network science has facilitated these analyses and our understanding of how the brain is structurally and functionally organized. While network science has catalyzed a paradigmatic shift in neuroscience, methods for statistically modeling and comparing groups of networks have lagged behind. To address this knowledge gap for cross-sectional network data, we developed a mixed modeling framework that enables quantifying the relationship between phenotype and connectivity patterns in the brain, predicting connectivity structure based on phenotype, simulating networks to gain a better understanding of normal ranges of topological variability, and thresholding individual networks leveraging group information. Here we extend this comprehensive approach to enable studying system-level brain properties across multiple tasks. We focus on rest-to-task network changes, but this extension is equally applicable to the assessment of network changes for any repeated task paradigm, including interrelated task designs such as those employed in multisensory studies. Our approach allows: 1) assessing the relationships between population state changes and health outcomes; 2) assessing the relationships between individual variability in state changes and health outcomes; and 3) deriving more accurate and precise estimates of the relationships between phenotype and health outcomes within a given task state by leveraging information from other states.

(an ML talk in the Biostatistics Speaker Series)

Mon 03/05/18, 10:00am, Clark 110**A Picture of the Energy Landscape of Deep Neural Networks***Pratik Chaudhari, UCLA*

**Abstract:** Deep networks are mysterious. These over-parametrized machine learning models, trained with rudimentary optimization algorithms on non-convex landscapes in millions of dimensions have defied attempts to put a sound theoretical footing beneath their impressive performance. This talk will shed light upon some of these mysteries. I will employ diverse ideas — from thermodynamics and optimal transportation to partial differential equations, control theory and Bayesian inference — and paint a picture of the training process of deep networks. Along the way, I will develop state-of-the-art algorithms for non-convex optimization. The goal of machine perception is not just to classify objects in images but instead, enable intelligent agents that can seamlessly interact with our physical world. I will conclude with a vision of how advances in machine learning and robotics may come together to help build such an Embodied Intelligence.

**Bio:** Pratik Chaudhari is a PhD candidate in the Computer Science department at the University of California, Los Angeles where he works with Stefano Soatto. His research interests include deep learning, robotics and computer vision. He has worked on perception and control algorithms for safe autonomous urban navigation as a part of nuTonomy Inc. Pratik holds Master’s and Engineer’s degrees from the Massachusetts Institute of Technology and a Bachelor’s degree from the Indian Institute of Technology Bombay in Aeronautics and Astronautics.

Fri 03/02/18, 12:00pm, Hackerman Hall B17**Connecting Vision and Language End-to-End***Kate Saenko, Boston University*

**Abstract:** Despite much progress in neural models for joint vision and language understanding, current models are largely opaque and non-compositional. Many language tasks are inherently compositional, and can be solved by decomposing them into modular sub-problems. I will describe End-to-End Module Networks (N2NMNs), which learn to answer questions about images by learning to decompose the question into subtasks, implemented as neural network modules. Experimental results show that N2NMNs achieve better accuracy than state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question. I will also talk about our recent work on dense video captioning, and describe an end-to-end network that localizes activities in a long video and generates captions to describe each detected activity.

**Bio:** Kate Saenko is an Associate Professor of Computer Science at Boston University, director of the Computer Vision and Learning Group and co-director of the AI Research initiative at BU. Her past academic positions include: Assistant Professor at the Computer Science Department at UMass Lowell, Postdoctoral Researcher at ICSI, Visiting Scholar at UC Berkeley EECS and a Visiting Postdoctoral Fellow in the School of Engineering and Applied Science at Harvard University. Her research interests are in the broad area of Artificial Intelligence with a focus on Adaptive Machine Learning, Learning for Vision and Language Understanding, and Deep Learning.

(an ML talk in the CLSP Speaker Series)

Thu 03/01/18, 12:00am, Whitehead 304**PetuumMed: Algorithm and System for EHR-based Medical Decision-Making***Eric Xing, CMU*

**Abstract:** With the rapid growth of electronic health records (EHRs) and the advancement of machine learning technologies, needs for AI-enabled clinical decision-making support is emerging. In this talk, I will present some recent work toward these needs at Petuum Inc. where an integrative system that distills insights from large-scale and heterogeneous patient data, as well as learns and integrates medical knowledge from broader sources such as the literatures and domain experts, and empowers medical professionals to make accurate and efficient decisions within the clinical flow, is being built. I will discuss several aspects of practical clinical decision-support, such as real-time information extraction from clinical notes and images, diagnosis and treatment recommendation, automatic report generation and ICD code filling; and the algorithmic and computational challenges behind production-quality solution to these problems.

**Bio:** Eric Xing is a Professor of Computer Science at Carnegie Mellon University, and Founder and CEO of the machine learning platform startup Petuum Inc. He completed his undergraduate study at Tsinghua University, and holds a PhD in Molecular Biology and Biochemistry from Rutgers University, and a PhD in Computer Science from the University of California, Berkeley. His main research interests are the development of machine learning and statistical methodology, and large-scale computational system and architectures, for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in artificial, biological, and social systems. Prof Xing is a board member of the International Machine Learning Society; in 2014, he served as the Program Chair of the International Conference of Machine Learning (ICML), and in 2019, he will serve as the General Chair of ICML. He is the Associate Department Head of the Machine Learning Department, founding director of the Center for Machine Learning and Health at Carnegie Mellon University, and a Fellow of the Association of Advancement of Artificial Intelligence.

(an ML talk in the AMS Speaker Series)

Sat 02/17/18, 12:15pm, 316 Clark Hall**From Biological Neural Networks to Artificial Neural Networks***Srini Turaga, HHMI Janelia Research Campus*

**Abstract:** In this talk, I will describe how we developed deep learning based computational tools to solve two important problems in neuroscience: predicting the activity of a neural network from measurements of its structural connectivity, and inferring the connectivity of a network of neurons from measurements and perturbation of neural activity. 1. Are measurements of the structural connectivity of a biological neural network sufficient to predict its function? We constructed a simplified model of the first two stages of the fruit fly visual system, the lamina and medulla. The result is a deep hexagonal lattice convolutional neural network which discovered well-known orientation and direction selectivity properties in T4 neurons and their inputs. Our work is the first demonstration, that knowledge of neural connectivity can enable in silico predictions of the functional properties of individual neurons in a circuit, leading to an understanding of circuit function from structure alone. 2. Can we infer neural connectivity from noisy measurement and perturbation of neural activity? Population neural activity measurement by calcium imaging can be combined with cellular resolution optogenetic activity perturbations to enable the mapping of neural connectivity in vivo. This requires accurate inference of perturbed and unperturbed neural activity from calcium imaging measurements, which are noisy and indirect. We built on recent advances in variational autoencoders to develop a new fully Bayesian approach to jointly inferring spiking activity and neural connectivity from in vivo all-optical perturbation experiments. Our model produces excellent spike inferences at 20K times real-time, and predicts connectivity for mouse primary visual cortex which is consistent with known measurements.

**Bio:** Srini Turaga is a group leader at the HHMI Janelia Research Campus. His lab develops machine learning tools for data analysis and modeling in neuroscience.

(an ML talk in the CIS Speaker Series)

Fri 02/16/18, 11:00am, Clark Hall 316**Restricted Isometry Property of Gaussian Random Projection for Low-Dimensional Subspaces***Yuantao Gu, Tsinghua University*

**Abstract:** Dimensionality reduction is in demand to reduce the complexity of solving large-scale problems with data lying in latent low-dimensional structures in machine learning and computer version. Motivated by such need, in this talk I will introduce the Restricted Isometry Property (RIP) of Gaussian random projections for low-dimensional subspaces in R^N, and prove that the projection Frobenius norm distance between any two subspaces spanned by the projected data in R^n for n

(an ML talk in the CIS Speaker Series)

Mon 02/05/18, 12:15pm, BSPH, Room W2008**Prior Adaptive Semi-supervised Learning with Application to Electronic Health Records Phenotyping***Yichi Zhang, Harvard School of Public Health*

**Abstract:** Electronic Health Records (EHR) provides large and rich data sources for biomedical researches, and EHR data have been successfully used to gain novel insights into several diseases. However, the usage of EHR data remains quite limited, because extracting precise phenotype for individual patient requires labor intensive medical chart review and such a manual process is not scalable. To facilitate an automatic procedure for accurate phenotyping, we formulate the problem in a high dimensional setting and propose a semi-supervised method that combine information from chart reviewed records with some data-driven prior knowledge derived from the entire dataset. The proposed estimator, Prior Adaptive Semi-supervised (PASS) estimator, enjoys nice theoretical properties including efficiency and robustness, and applies to a broad class of problems beyond EHR applications. The finite sample performance is evaluated via simulation studies and a real dataset on rheumatoid arthritis phenotyping. Further improvements involving word embedding and selective sampling are discussed.

(an ML talk in the Biostatistics Speaker Series)

Wed 01/31/18, 03:00pm, Gilman 219**Principled Non-Convex Optimization for Deep Learning and Phase Retrieval***Tom Goldstein, University of Maryland College Park*

**Abstract:** This talk looks at two classes of non-convex problems. First, we discuss phase retrieval problems, and present a new formulation, called PhaseMax, that reduces this class of non-convex problems into a convex linear program. Then, we turn our attention to more complex non-convex problems that arise in deep learning. We’ll explore the non-convex structure of deep networks using a range of visualization methods. Finally, we discuss a class of principled algorithms for training “binarized” neural networks, and show that these algorithms have theoretical properties that enable them to overcome the non-convexities present in neural loss functions.

Mon 12/04/17, 12:15pm, BSPH, Room W2008**Exponential Family Functional Data Analysis via a Low-Rank Model***Gen Li, Columbia University*

**Abstract:** In many applications, non-Gaussian data such as binary or count are observed over a continuous domain and there exists a smooth underlying structure for describing such data. We develop a new functional data method to deal with this kind of data when the data are regularly spaced on the continuous domain. Our method, referred to as Exponential Family Principal Component Analysis (EFPCA), assumes the data are generated from an exponential family distribution, and the matrix of the canonical parameters has a low-rank structure. The proposed method flexibly accommodates not only the standard one-way functional data, but also two-way (or bivariate) functional data. In addition, we introduce a new cross validation method for estimating the latent rank of a generalized data matrix. We demonstrate the efficacy of the proposed methods using a comprehensive simulation study. The proposed method is also applied to a real application of the UK mortality study, where data are binomially distributed and two-way functional across age groups and calendar years. The results offer novel insights into the underlying mortality pattern.

(an ML talk in the Biostatistics Speaker Series)

Thu 11/16/17, 10:30am, Hackerman Hall B17**Feature Selection and Fusion for 3D Object Category Recognition***Haider Ali, JHU*

**Abstract:** Deep learning methods have received lots of attention in research on 3D object recognition. Due to the lack of training data, many researchers use pre-trained convolutional neural networks (CNNs) and either extract the output of one of the last layers as features or fine-tune the networks on their data. Due to the extraordinary features that can be obtained from a CNN, we intend to use them for 3D object recognition in the task of robotics using active learning and the Mondrian forest classifier. We achieve superior results with a method that fine-tunes a CNN before feature extraction for RGB data. Combined with extracted features from depth data and reducing the features’ dimensionalities, we improve the state-of-the-art accuracy on the University of Washington RGB-D Object dataset in the standards offline case, using a support vector machine (SVM). Instead of SVM as a classifier, we use active learning and the Mondrian forest, an online classifier, which can be updated over time once more data is available. Additionally, in our earlier work we present a novel combination of depth and color features to recognize different object categories in isolation. We also investigate the effect of domain change by training on RGB-D Object dataset and testing on DLR-RGB-D dataset. In our experiments we show that a domain change can have significant impact on the model’s accuracy, and present results for improving the results by increasing the variability of the objects in the training domain.

**Bio:** Dr.Haider Ali is currently serving as an Associate Research Scientist at The Center for Imaging Science (CIS),Johns Hopkins University. Before joining CIS he worked as a Senior Researcher at the Institute of Robotics and Mechatronics (RM), German Aerospace Center (DLR). His research is primarily focused on developing efficient 3D object and activity recognition methods for real time robotic applications. He received his Bachelor of Science in Computer Science from Bahauddin Zakariya University one of Pakistan’s major universities in 1998.After that he served several multinational IT companies in Pakistan as Software Engineer and Project Consultant until 2004. Thereafter he planned to pursue a master degree in software technology from Leuphana University of Lueneburg and graduated in 2006. He received his Ph.D. from Technical University of Vienna in 2010.

(an ML talk in the CS Speaker Series)

Tue 11/14/17, 12:00pm, Hackerman B17**End-to-End Deep Learning for Broad Coverage Semantics: SRL, Conference and Beyond***Luke Zettlemoyer, University of Washington*

**Abstract:** Deep learning with large supervised training sets has had significant impact on many research challenges, from speech recognition to machine translation. However, applying these ideas to problems in computational semantics has been difficult, at least in part due to modest dataset sizes and relatively complex structured prediction tasks. In this talk, I will present two recent results on end-to-end deep learning for classic challenge problems in computational semantics: semantic role labeling and coreference resolution. In both cases, we will introduce relative simple deep neural network approaches that use no preprocessing (e.g. no POS tagger or syntactic parser) and achieve significant performance gains, including over 20% relative error reductions when compared to non-neural methods. I will also briefly discuss our efforts for crowdsourcing large new datasets that should, in the very near future, provide orders of magnitude more data for training such models. Our hope is that these two advances, when combined, will enable very high quality semantic analysis in any domain from easily gathered supervision. This is joint work with Luheng He, Kenton Lee, and Mike Lewis

**Bio:** Luke Zettlemoyer is an Associate Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, and also leads the AllenNLP project at the Allen Institute for Artificial Intelligence. His research focuses on empirical computational semantics, and involves designing machine learning algorithms and building large datasets. Honors include multiple paper awards, a PECASE award, and an Allen Distinguished Investigator Award. Luke received his PhD from MIT and was a postdoc at the University of Edinburgh.

(an ML talk in the CLSP Speaker Series)

Thu 10/19/17, 12:00pm, Hackerman B17**Online Vehicle Routing: The Edge of Optimization at the Largest Scale***Dimitris Bertsimas, MIT*

**Abstract:** With the emergence of new companies that offer transportation on demand at large scale and the availability of very large data sets, new challenges arise to develop vehicle routing optimization algorithms that are capable of solving very large real-time problems involving tens of thousands of customers per hour. In this seminar, we develop generalizable algorithms that allow us to apply mixed integer optimization to the largest practical problem scale. Our optimization framework, coupled with a novel backbone algorithm, allows us to dispatch in real time thousands of taxis serving more than 25,000 customers per hour using the real New York City demand data and routing network. We provide evidence from historical simulations to show that our algorithms improve upon the performance of existing heuristics in real-world settings.

**Bio:** Dimitris Bertsimas is currently the Boeing Professor of Operations Research and the co-director of the Operations Research Center at the Massachusetts Institute of Technology. He has received a BS in Electrical Engineering and Computer Science at the National Technical University of Athens, Greece, a MS in Operations Research at MIT, and a Ph.D in Applied Mathematics and Operations Research at MIT. Since 1988, he has been with the MIT faculty. Since the 1990s he has started several successful companies in the areas of financial services, asset management, health care, publishing, analytics and aviation. His current research interests include personalized medicine, analytics, machine learning and robust and adaptive optimization. He has co-authored more than 200 scientific papers and four textbooks. He is the current editor in Chief of the INFORMS journal of Optimization and a former area editor in Operations Research in Financial Engineering and in Management Science in Optimization. He has supervised 63 doctoral students and he is currently supervising 25 others.

Tue 10/03/17, 12:00pm, Hackerman B17**Sequence to Sequence Learning: Fast Training and Inference with Fated Convolutions***Michael Auli, Facebook AI Research*

**Abstract:** Neural architectures for machine translation and language modeling is an active research field. The first part of this talk introduces several architectural changes to the original work of Bahdanau et al. 2014. We replace non-linearities with our novel gated linear units, recurrent units with convolutions and introduce multi-hop attention. These changes improve generalization performance, training efficiency and decoding speed. The second part of the talk analyzes the properties of the distribution predicted by the model and how this influences search.

**Bio:** Michael Auli is a research scientist at Facebook AI Research in Menlo Park. Michael earned his PhD for his work on CCG parsing at the University of Edinburgh where he was advised by Adam Lopez and Philipp Koehn. He did his postdoc at Microsoft Research where he worked on neural machine translation and neural dialogue models. Currently, Michael works on machine learning and its application to natural language processing, he is particularly interested in text generation tasks.

(an ML talk in the CLSP Speaker Series)

Fri 09/29/17, 12:00pm, Hackerman B17**Deep Reinforcement Learning of Sequential Decision Making Tasks with Natural Language Interaction***Satinder Singh, University of Michigan*

**Abstract:** The success of Deep Learning (DL) on visual perception has led to rapid progress on Reinforcement Learning (RL) tasks with visual inputs. More recently, Deep Learning is showing promise at certain kinds of supervised natural language problems and this too is making its way into helping on RL tasks with natural language inputs. In this talk, I will describe two projects in this direction from my group. The first (url 1 below) involves learning to query, reason, and answer questions on simple forms of ambiguous texts designed to focus on a specific problem that occurs in dialog systems. The second (url 2 below) involves zero shot generalization to unseen instructions in a 3d maze navigation task for which we develop a hierarchical DeepRL architecture. 1. http://web.eecs.umich.edu/~baveja/Papers/GuoICLR2017.pdf 2. http://web.eecs.umich.edu/~baveja/Papers/task-generalization.pdf

**Bio:** Dr. Satinder Singh is a Professor of Computer Science & Engineering at the University of Michigan where he also served as the Director of the Artificial Intelligence Laboratory from 2006 to 2017. He is also a co-founder and Chief Scientist of Cogitai, Inc. Dr. Singh’s research interests focus on the field of Reinforcement Learning, i.e., on building algorithms, theory, and architectures for software agents that can learn how to act in uncertain, complex, and dynamic environments. Specific interests include building models of dynamical systems from time-series data, learning good interventions in human-machine interaction, dealing with partial observability and hidden state in sequential decision-making, dealing with the challenge of exploration-exploitation and delayed feedback, explaining animal and human decision making using computational models, and optimal querying in semi-autonomous agents based on value of information. He is interested in applications from healthcare, robotics, and game-playing. He is a Fellow of the Association for the Advancement of Artificial Intelligence, was Program Co-Chair of AAAI 2016, has received an outstanding faculty award from his department, and has published over 150 papers in his field.

(an ML talk in the CLSP Speaker Series)

Fri 09/22/17, 12:00pm, Hackerman B17**Structure-Sensitive Dependency Learning in Recurrent Neural Networks***Tal Linzen, JHU*

**Abstract:** Neural networks have recently become ubiquitous in natural language processing systems, but we typically have little understanding of specific capabilities of these networks beyond their overall accuracy in an applied task. The present work investigates the ability of recurrent neural networks (RNNs), which are not equipped with explicit syntactic representations, to learn structure-sensitive dependencies from a natural corpus; we use English subject-verb number agreement as our test case. We examine the success of the RNNs (in particular LSTMs) in predicting whether an upcoming English verb should be plural or singular. We focus on specific sentence types that are indicative of the network’s syntactic abilities; our tests use both naturally occurring sentences and constructed sentences from the experimental psycholinguistics literature. We analyze the internal representations of the network to explore the sources of its ability (or inability) to approximate sentence structure. Finally, we compare the errors made by the RNNs to agreement attraction errors made by humans. RNNs were able to approximate certain aspects of syntactic structure very well, but only in common sentence types and only when trained specifically to predict the number of a verb (as opposed to a standard language modeling objective). In complex sentences their performance degraded substantially; they made many more errors than human participants. These results suggest that stronger inductive biases are likely to be necessary to eliminate errors altogether; we begin to investigate to what extent these biases can arise from multi-task learning. More broadly, our work illustrates how methods from linguistics and psycholinguistics can help us understand the abilities and limitations of “black-box” neural network models.

**Bio:** Tal Linzen is an Assistant Professor of Cognitive Science at Johns Hopkins University. Before moving to Johns Hopkins, he was a postdoctoral researcher at the École Normale Supérieure in Paris, where he worked with Emmanuel Dupoux and Benjamin Spector, and was affiliated with the Laboratoire de Sciences Cognitives et Psycholinguistique and Institut Jean Nicod. Dr. Linzen obtained his PhD from the Department of Linguistics at New York University in 2015, under the supervision of Alec Marantz. His interests are in developing and testing cognitive models of human language; particular problems he has worked on are probabilistic prediction in language comprehension, generalization in language learning and the linguistic capacities of artificial neural networks.

(an ML talk in the CLSP Speaker Series)

Thu 09/21/17, 01:30pm, Room 461 Bloomberg Building**Tutorial on Deep Learning with Apache MXNet Gluon***Alex Smola, Amazon Machine Learning and CMU Machine Learning Department*

**Abstract:** This tutorial introduces Gluon, a flexible new interface that pairs MXNet’s speed with a user-friendly frontend. Symbolic frameworks like Theano and TensorFlow offer speed and memory efficiency but are harder to program. Imperative frameworks like Chainer and PyTorch are easy to debug but they can seldom compete with the symbolic code when it comes to speed. Gluon reconciles the two, removing a crucial pain point by using just-in-time compilation and an efficient runtime engine for efficiency. In this crash course, we’ll cover deep learning basics, the fundamentals of Gluon, advanced models, and multiple-GPU deployments. We will walk you through MXNet’s NDArray data structure and automatic differentiation tools. Well show you how to define neural networks at the atomic level, and through Gluon’s predefined layers. We’ll demonstrate how to serialize models and build dynamic graphs. Finally, we will show you how to hybridize your networks, simultaneously enjoying the benefits of imperative and symbolic deep learning.

**Bio:** Dr. Smola studied physics in Munich at the University of Technology, Munich, at the Universita degli Studi di Pavia and at AT&T Research in Holmdel. During this time, he was at the Maximilianeum München and the Collegio Ghislieri in Pavia. In 1996, he received his Master degree at the University of Technology, Munich and in 1998 the Doctoral Degree in computer science at the University of Technology Berlin. Until 1999, he was a researcher at the IDA Group of the GMD Institute for Software Engineering and Computer Architecture in Berlin (now part of the Fraunhofer Geselschaft). After that, he worked as a Researcher and Group Leader at the Research School for Information Sciences and Engineering of the Australian National University. From 2004 onwards, he worked as a Senior Principal Researcher and Program Leader at the Statistical Machine Learning Program at NICTA. From 2008 to 2012, he worked at Yahoo Research. In spring of 2012, he moved to Google Research “to spend a wonderful year in Mountain View” and continued to work there until the end of 2014. Since 2013, Dr. Smola has been a professor at Carnegie Mellon University. He co-founded Marianas Labs in early 2015.

Thu 09/21/17, 10:45am, Hackerman B17**Sequence Modeling: From Spectral Methods and Bayesian Nonparametrics to Deep Learning***Alex Smola, Amazon*

**Abstract:** In this talk I will summarize a few recent developments in the design and analysis of sequence models. Starting with simple parametric models such as HMMs for sequences we look at nonparametric extensions in terms of their ability to model more fine-grained types of state and transition behavior. In particular we consider spectral embeddings, nonparametric Bayesian models such as the nested Chinese Restaurant Franchise and the Dirichlet-Hawkes Process. We conclude with a discussion of deep sequence models for user return time modeling, time-dependent collaborative filtering, and large-vocabulary user profiling.

**Bio:** Dr. Smola studied physics in Munich at the University of Technology, Munich, at the Universita degli Studi di Pavia and at AT&T Research in Holmdel. During this time, he was at the Maximilianeum München and the Collegio Ghislieri in Pavia. In 1996, he received his Master degree at the University of Technology, Munich and in 1998 the Doctoral Degree in computer science at the University of Technology Berlin. Until 1999, he was a researcher at the IDA Group of the GMD Institute for Software Engineering and Computer Architecture in Berlin (now part of the Fraunhofer Geselschaft). After that, he worked as a Researcher and Group Leader at the Research School for Information Sciences and Engineering of the Australian National University. From 2004 onwards, he worked as a Senior Principal Researcher and Program Leader at the Statistical Machine Learning Program at NICTA. From 2008 to 2012, he worked at Yahoo Research. In spring of 2012, he moved to Google Research “to spend a wonderful year in Mountain View” and continued to work there until the end of 2014. Since 2013, Dr. Smola has been a professor at Carnegie Mellon University. He co-founded Marianas Labs in early 2015.

(an ML talk in the CS Speaker Series)

Mon 09/18/17, 12:15pm, School of Public Health, Room W2008**Link-Tracing Studies of Hidden Networks***Forrest W. Crawford, Yale University*

**Abstract:** Respondent-driven sampling (RDS) is a link-tracing survey method for sampling members of a hidden or hard-to-reach population such as drug users, sex workers, or homeless people via their social network. Starting with a set of “seed” subjects, participants use a small number of coupons tagged with a unique code to recruit their social contacts by giving them a coupon. Subjects report their network degree, but not the identities of their contacts. RDS is controversial and researchers disagree about whether it can be used to estimate population-level characteristics of hidden risk groups. In this presentation, I outline four results that permit principled network-based epidemiology from RDS. First, I show that a simple continuous-time model of RDS recruitment implies a well-defined probability distribution on the recruitment-induced subgraph of respondents; the resulting distribution is an exponential random graph model (ERGM). I develop a computationally efficient method for estimating the hidden graph. Second, I show that two sources of dependence in the RDS sample — network homophily and preferential recruitment — are confounded. However, it is still possible to make valid inferences via nonparametric graph-theoretic identification regions that permit hypothesis testing. Third, I derive conservative standard errors via graph-theoretic bounds for statistical functionals of the induced subgraph and traits of sampled subjects, including estimators of the population mean. Fourth, I describe a simple technique — based on capture-recapture and the network scale-up method — for estimating the size of a hidden population from an RDS sample. I apply these techniques to RDS studies of drug users in Eastern Europe, Russia, and Lebanon.

(an ML talk in the Biostatistics Speaker Series)

Wed 09/13/17, 03:00pm, Hodson Hall 203**No Equations, No Variables, No Parameters, No Space, No Time: Data and the Computational Modeling of Complex/Multiscale Systems***Yannis Kevrekidis, ChemBE, JHU*

**Abstract:** Obtaining predictive dynamical equations from data lies at the heart of science and engineering modeling, and is the linchpin of our technology. In mathematical modeling one typically progresses from observations of the world (and some serious thinking!) first to equations for a model, and then to the analysis of the model to make predictions. Good mathematical models give good predictions (and inaccurate ones do not) – but the computational tools for analyzing them are the same: algorithms that are typically based on closed form equations. While the skeleton of the process remains the same, today we witness the development of mathematical techniques that operate directly on observations -data-, and appear to circumvent the serious thinking that goes into selecting variables and parameters and deriving accurate equations. The process then may appear to the user a little like making predictions by “looking in a crystal ball”. Yet the “serious thinking” is still there and uses the same -and some new- mathematics: it goes into building algorithms that “jump directly” from data to the analysis of the model (which is now not available in closed form) so as to make predictions. Our work here presents a couple of efforts that illustrate this “new” path from data to predictions. It really is the same old path, but it is travelled by new means.

Fri 09/08/17, 12:00pm, Hackerman B17**An Overview of Deep Learning Frameworks and an Introduction to PyTorch***Soumith Chintala, Facebook*

**Abstract:** In this talk, you will get an exposure to the various types of deep learning frameworks – declarative and imperative frameworks such as TensorFlow and PyTorch. After a broad overview of frameworks, you will be introduced to the PyTorch framework in more detail. We will discuss your perspective as a researcher and a user, formalizing the needs of research workflows (covering data pre-processing and loading, model building, etc.). Then, we shall see how the different features of PyTorch map to helping you with these workflows.

**Bio:** Soumith Chintala is one of the main developers of PyTorch. He is a Researcher at Facebook AI Research, where he works on deep learning, reinforcement learning, generative image models, agents for video games and large-scale high-performance deep learning. Prior to joining Facebook in August 2014, he worked at MuseAmi, where he built deep learning models for music and vision targeted at mobile devices. He holds a Masters in CS from NYU, and spent time in Yann LeCun’s NYU lab building deep learning models for pedestrian detection, natural image OCR, depth-images among others.

(an ML talk in the CLSP Speaker Series)

Tue 05/09/17, 01:30pm, Clark Hall**Understanding Deep Neural Networks with Rectified Linear Units***Amitabh Basu, JHU*

**Abstract:** We will begin by giving a precise characterization of the family of functions representable by deep neural networks (DNN) with rectified linear units (ReLU). In this characterization, we will also attempt to give a gradation within this family of functions based on size and depth of the network. As a consequence of our classification, we give polynomial time algorithms for training ReLU DNN with one hidden layer to *global optimality*, for a fixed size and number of inputs; in other words, the running time is polynomial in the number of data points to be trained on, assuming the size and number of inputs to the network as fixed constants. We will next investigate the depth v/s size trade-off for such neural networks. In particular, we will construct a smoothly parameterized family of R -> R “hard” functions that lead to an exponential blow-up in size, if the number of layers is decreased by a small amount. An example consequence of our gap theorem is that for every natural number N, there exists a smoothly parameterized family of R-> R functions representable by ReLU DNNs with N^2 hidden layers and total size N^3, such that any ReLU DNN with depth at most N hidden layers will require at least 1/2*(N^N)−1 total nodes to represent any function from this class. Finally, we construct a family of R^n -> R functions for n >= 2 (also smoothly parameterized), whose number of affine pieces scales exponentially with the dimension ‘n’ at any fixed size and depth. To the best of our knowledge, such a construction with exponential dependence on ‘n’ has not been achieved by previous families of “hard” functions in the neural nets literature. The mathematical tools used to obtain the above results range from tropical geometry to the theory of zonotopes from polyhedral theory. This is a joint work with Raman Arora, Poorya Mianjy and Anirbit Mukherjee.

**Bio:** Amitabh Basu is an assistant professor in the Dept. of Applied Mathematics and Statistics at Johns Hopkins University since 2013. He was a visiting assistant professor in the Dept. of Mathematics at University of California, Davis, from 2010-2013. He obtained his Ph.D. in 2010 from Carnegie Mellon University with a thesis on mixed-integer optimization. He received an M.S. in Computer Science from Stony Brook University in 2006, and a bachelor’s in Computer Science from Indian Institute of Technology, Delhi in 2004. Amitabh Basu is the recipient of the NSF CAREER award in 2015. He was one of the three finalists for the A.W. Tucker Prize 2012 awarded by the Mathematical Optimization Society. Basu serves as an Associate Editor for the journal Mathematics of Operations Research. He also currently serves as Vice Chair for the Integer and Discrete Optimization cluster within the INFORMS Optimization Society. http://www.ams.jhu.edu/~abasu9/

(an ML talk in the CIS Speaker Series)

Tue 05/02/17, 01:30pm, Clark Hall 314**Neural Spike Train Analysis Using The Statistical Paradigm***Robert E. Kass, CMU*

**Abstract:** I will briefly describe neural spike train data, and what I mean by “the statistical paradigm.” I will then illustrate by going through several examples. Along the way I’ll describe what I consider one of the major data analytic challenges in neuroscience. Finally, I’ll make a few comments about scientific reproducibility.

**Bio:** Robert E. (Rob) Kass is the Maurice Falk Professor of Statistics and Computational Neuroscience at Carnegie Mellon University. He received his Ph.D. in Statistics from the University of Chicago in 1980. His early work formed the basis for his book Geometrical Foundations of Asymptotic Inference, co-authored with Paul Vos. His subsequent research has been in Bayesian inference and, beginning in 2000, in the application of statistics to neuroscience. Kass is known not only for his methodological contributions, but also for several major review articles, including one with Adrian Raftery on Bayes factors (JASA, 1995) one with Larry Wasserman on prior distributions (JASA, 1996), and a pair with Emery Brown on statistics in neuroscience (Nature Neuroscience, 2004, also with Partha Mitra; Journal of Neurophysiology, 2005, also with Valerie Ventura). His book Analysis of Neural Data, with Emery Brown and Uri Eden, was published in 2014. Kass has also written widely-read articles on statistical education. Recently, he and several co-authors published “Ten Simple Rules for Effective Statistical Practice” (PLOS Computational Biology, 2016). http://www.stat.cmu.edu/~kass

(an ML talk in the CIS Speaker Series)

Fri 04/28/17, 01:30pm, Clark 314**How Distributed ADMM is Affected by Network Topology***Guilherme Franca, Center for Imaging Science, JHU*

**Abstract:** At the core of most problems in machine learning and statistics lies an optimization problem, and in our current age of ever increasing datasets it is indispensable to have scalable and fully distributed optimization algorithms. The alternating direction method of multipliers (ADMM) is a good alternative, since it is extremely robust, scalable, and easily distributed. Nevertheless, its rate of convergence is still poorly understood. Here we consider a fully distributed implementation of over-relaxed ADMM when solving a consensus problem. This implementation can be seen as a message passing algorithm, where several agents solve small and local problems, and share messages through a communication network. This network captures properties of the distributed implementation and of the particular problem being solved. Our goal is to provide a precise answer on how the convergence rate of ADMM depends on the topology of such a network. To this end, we first establish a duality between distributed ADMM and lifted Markov chains, which lead us to propose an interesting conjecture. Next, we attack the problem directly by computing the spectrum of an equivalent linear dynamical system, whose state transition matrix is nonsymmetric. Our results yield an explicit formula for ADMM’s rate of convergence, and also optimal parameter selection, in terms of spectral properties of the network. Using these results we are able to prove our conjecture, which is reminiscent from the speedup on the mixing time achieved by lifting several Markov chains.

**Bio:** Guilherme Franca obtained his PhD in Theoretical Physics (2011) from the International Institute of Theoretical Physics, in Sao Paulo, Brazil. After that he was a postdoctoral researcher in the Theoretical Particle Physics Group at Cornell University. He worked on mathematical methods at the interface of Quantum Field Theory, Statistical Mechanics, and pure Mathematics. By the end of this postdoc, he has developed interest in Machine Learning, and consequently joined the Center of Imaging Science at Johns Hopkins as a postdoctoral fellow since the second half of the last year.

(an ML talk in the CIS Speaker Series)

Thu 04/27/17, 12:15pm, Genome Cafe (SPH E3609)**Enriched Training Sample Selection for Machine Learning in Cancer Screening Studies***Peng Huang, Oncology Biostatistics, JHMI*

**Abstract:** The percent of cancers detected from screening studies is typically small (rare event rate). Extensive publications have shown that computer aided diagnosis (CAD) approach could give high diagnostic accuracy. To translate this technique into clinical practice, a training data is often selected from historical images and machine-learning algorithms are applied to the images to develop prediction algorithm. A good training sample selection is critical to obtain a good prediction. Intuitively, a random sample from the historical data should be selected. However, due to rare event rate, the number of cancers included in the random sample is typically small, making it difficult to reach high accuracy if the training sample size is not large enough. On the other hand, large training sample size imposes computation expense in CAD. The question is how to select a good training set with fixed sample size N. I propose an enriched sampling method which shows improved prediction accuracy and image marker selection as compared to both random training sample and case-control sample with fixed sample size N.

Tue 04/25/17, 10:45am, Hackerman B17**Compositional Models for Information Extraction***Mark Dredze, JHU*

**Abstract:** Relation extraction systems are the backbone of many end-user applications, including question answering and web search. They are also increasingly used in clinical text analysis with EHR data to advance goals in population health. Advances in machine learning have led to new neural models for learning effective representations directly from data. Yet for many tasks, years of research have created hand-engineered features that yield state of the art performance. This is the case in relation extraction, in which a system consumes natural language and produces a structured machine readable representation of relationships between entities, such as extracting medication references from clinical notes.

**Bio:** Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist at the Human Language Technology Center of Excellence. He is also affiliated with the Center for Language and Speech Processing, the Center for Population Health Information Technology, and holds a secondary appointment in the Department of Health Sciences Informatics in the School of Medicine. He obtained his PhD from the University of Pennsylvania in 2009. Prof. Dredze has wide-ranging research interests developing machine learning models for natural language processing (NLP) applications. Within machine learning, he develops new methods for graphical models, deep neural networks, topic models and online learning, and has worked in a variety of learning settings, such as semi-supervised learning, transfer learning, domain adaptation and large-scale learning. Within NLP he focuses on information extraction but has considered a wide range of NLP tasks, including syntax, semantics, sentiment and spoke language processing. Beyond his work in core areas of computer science, Prof. Dredze has pioneered new applications of these technologies in public health informatics, including work with social media data, biomedical articles and clinical texts. He has published widely in health journals including the Journal of the American Medical Association (JAMA), the American Journal of Preventative Medicine (AJPM), Vaccine, and the Journal of the American Medical Informatics Association (JAMIA). His work is regularly covered by major media outlets, including NPR, the New York Times and CNN.

(an ML talk in the CS Speaker Series)

Mon 04/24/17, 12:15pm, SPH Room W2008**Classified Mixed Model Prediction***J. Sunil Rao, University of Miami*

**Abstract:** Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP and its comparison with existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Some examples will be presented including making predictions from breast cancer genomic data samples. Additionally, some delineation of the extension to the unknown grouping structure problem will be provided. This is joint work with Jiming Jiang of UC-Davis, Jie Fan of the University of Miami and Thuan Nguyen of Oregon Health and Science University.

(an ML talk in the Biostatistics Speaker Series)

Mon 04/24/17, 10:45am, Shaffer 101**Geometry, Optimization and Generalization in Multilayer Networks***Nathan (Nati) Srebro, TTIC and the University of Chicago*

**Abstract:** What is it that enables learning with multi-layer networks? What makes it possible to optimize the error, despite the problem being hard in the worst case? What causes the network to generalize well despite the model class having extremely high capacity? And how can we generalize even better? In this talk I will explore these questions through experimentation, analogy to matrix factorization, and study of alternate geometries and optimization approaches, and use the insight to develop improved training methods for deep and recurrent networks. Joint work with Behnam Beyshabur, Srinadh Bhojanapalli, Suriya Gunsekar, Russ Salakhutdinov, Ryota Tomioka and Tony Wu.

**Bio:** Nati Srebro obtained his PhD at the Massachusetts Institute of Technology (MIT) in 2004, held a post-doctoral fellowship with the Machine Learning Group at the University of Toronto, and was a Visiting Scientist at IBM Haifa Research Labs. Since January 2006, he has been on the faculty of the Toyota Technological Institute at Chicago (TTIC) and the University of Chicago, and has also served as the first Director of Graduate Studies at TTIC. From 2013 to 2014 he was associate professor at the Technion-Israel Institute of Technology. Prof. Srebro’s research encompasses methodological, statistical and computational aspects of Machine Learning, as well as related problems in Optimization. Some of Prof. Srebro’s significant contributions include work on learning “wider” Markov networks,pioneering work on matrix factorization and collaborative prediction, including introducing the use of the nuclear norm for machine learning and matrix reconstruction and work on fast optimization techniques for machine learning, and on the relationship between learning and optimization.

(an ML talk in the CS Speaker Series)

Thu 04/20/17, 01:30pm, Whitehead 304**From Solving PDEs to Machine Learning PDEs: An Odyssey in Computational Mathematics***George Kardiadakis, Brown University*

**Abstract:** In the last 30 years I have pursued the numerical solution of partial differential equations (PDEs) using spectral and spectral elements methods for diverse applications, starting from deterministic PDEs in complex geometries, to stochastic PDEs for uncertainty quantification, and to fractional PDEs that describe non-local behavior in disordered media and viscoelastic materials. More recently, I have been working on solving PDEs in a fundamentally different way. I will present a new paradigm in solving linear and nonlinear PDEs from noisy measurements without the use of the classical numerical discretization. Instead, we infer the solution of PDEs from noisy data, which can represent measurements of variable fidelity. The key idea is to encode the structure of the PDE into prior distributions and train Bayesian nonparametric regression models on available noisy data. The resulting posterior distributions can be used to predict the PDE solution with quantified uncertainty, efficiently identify extrema via Bayesian optimization, and acquire new data via active learning. Moreover, I will present how we can use this new framework to learn PDEs from noisy measurements of the solution and the forcing terms.

(an ML talk in the AMS Speaker Series)

Mon 04/17/17, 12:00pm, School of Public Health, Room W2008**C-Learning: A New Classification Framework to Estimate Optimal Dynamic Treatment Regimes***Min Zhang, University of Michigan*

**Abstract:** A dynamic treatment regime is a sequence of decision rules, each corresponding to a decision point, that determine that next treatment based on each individual’s own available characteristics and treatment history up to that point. We show that identifying the optimal dynamic treatment regime can be recast as a sequential optimization problem and propose a direct sequential optimization method to estimate the optimal treatment regimes. In particular, at each decision point, the optimization is equivalent to sequentially minimizing a weighted expected misclassification error. Based on this classification perspective, we propose a powerful and flexible C-learning algorithm to learn the optimal dynamic treatment regimes backward sequentially from the last stage until the first stage. C-learning is a direct optimization method that directly targets optimizing decision rules by exploiting powerful optimization/classification techniques and it allows incorporation of patient’s characteristics and treatment history to improves performance, hence enjoying the advantages of both the traditional outcome regression based methods (Q-and A-learning) and the more recent direct optimization methods.

(an ML talk in the Biostatistics Speaker Series)

Fri 04/14/17, 12:00pm, Hackerman B17**Bayesian Optimization and Other Potentially Bad Ideas for Hyperparameter Optimization***Kevin Jamieson, UC Berkeley*

**Abstract:** Performance of machine learning systems depends critically on tuning parameters that are difficult to set by standard optimization techniques. Such “hyperparameters”—including model architecture, regularization, and learning rates—are often tuned in an outerloop by black-box search methods evaluating performance on a holdout set. We formulate such hyperparameter tuning as a pure-exploration problem of deciding how many resources should be allocated to particular hyperparameter configurations. I will introduce our Hyperband algorithm for this framework and a theoretical analysis that demonstrates its ability to adapt to uncertain convergence rates and the dependency of hyperparameters on the validation loss. I will close with several experimental validations of Hyperband, including experiments on training deep networks where Hyperband outperforms state-of-the-art Bayesian optimization methods by an order of magnitude. I will also describe a highly scalable and asynchronous version of Hyperband I implemented and validated at Google.

**Bio:** Kevin Jamieson is a postdoctoral researcher working with Professor Benjamin Recht in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He is interested in the theory and practice of machine learning algorithms that sequentially collect data using an adaptive strategy. This includes active learning, multi-armed bandit problems, and stochastic optimization. Kevin received his Ph.D. from the University of Wisconsin, Madison under the advisement of Robert Nowak. Prior to his doctoral work, Kevin received his B.S. from the University of Washington, and an M.S. from Columbia University, both in electrical engineering.

(an ML talk in the CLSP Speaker Series)

Thu 04/13/17, 01:30pm, Whitehead 304**Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis***Peter Muller, University of Texas at Austin*

**Abstract:** Constructing gene regulatory networks is a fundamental task in systems biology. We introduce a Gaussian reciprocal graphical model for inference about gene regulatory relationships by integrating mRNA gene expression and DNA level information including copy number and methylation. Data integration allows for inference on the directionality of certain regulatory relationships, which would be otherwise indistinguishable due to Markov equivalence. Efficient inference is developed based on simultaneous equation models. Bayesian model selection techniques are adopted to estimate the graph structure. We illustrate our approach by simulations and two applications in ZODIAC pairwise gene interaction analysis and colon adenocarcinoma pathway analysis. Y. Ni, Y. Ji and P. Mueller Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis, http://arxiv.org/abs/1607.06849

**Bio:** Dr. Peter Müller is a professor in department of Mathematics and department of Statistics at University of Texas at Austin. Overall, he is a leader and pioneer in Bayesian statistics and MCMC field. His research areas include: Bayesian analysis and decision making, Markov chain Monte Carlo methods, clinical trial design, Nonparametric Bayes Modeling: dependent gene expression, longitudinal data models, pharmacokinetic/pharmacodynamic models, case-control studies, hierarchical models. He is the Fellow of IMS, ASA, ISBA. Also, he served as president or chair on many different statistical organizations, such as ISBA, SBSS. And he was Robert R. Herring Distinguished Professorship in Clinical Research (2007–2011). Here is his homepage: https://www.ma.utexas.edu/users/pmueller/

(an ML talk in the AMS Speaker Series)

Wed 04/12/17, 03:00pm, Whitehead Hall 304**A Sub-Linear Deterministic FFT for Sparse High Dimensional Signals***Andrew Christlieb, Michigan State University*

**Abstract:** In this talk we investigate the problems of efficient recover of sparse signals (sparsity=k) in a high dimensional setting. In particular, we are going to investigate efficient recovery of the k largest Fourier modes of a signal of size N^d, where N is the bandwidth and d is the dimension. Our objective is the development of a high dimensional sub-linear FFT, d=100 or 1000, that can recover the signal in O(d k log k) time. The methodology is based on our one dimensional deterministic sparse FFT that is O(k log k). The original method is recursive and based on ratios of short FFTs of pares of sub-sampled signals. The same ratio test allows us to identify when there is a collision due to aliasing the sub-sampled signals. The recursive nature allows us to separate and identify frequencies that have collided. Key in the high dimensional setting is the introduction of a partial unwrapping method and a tilting method that can ensure that we avoid collisions in the high dimensional setting on sub-sampled grids. We present the method, some analysis and results for a range of tests in both the noisy and noiseless cases.

Mon 04/10/17, 01:30pm, Clark 314**Stochastic Approximation for Representation Learning***Raman Arora, JHU*

**Abstract:** Unsupervised learning of useful features, or representations, is one of the most basic challenges of machine learning. Unsupervised representation learning techniques capitalize on unlabeled data which is often cheap and abundant and sometimes virtually unlimited. The goal of these ubiquitous techniques is to learn a representation that reveals intrinsic low-dimensional structure in data, disentangles underlying factors of variation, and is useful across multiple tasks and domains. This talk will focus on new theory and methods for large-scale representation learning. We will motivate a stochastic optimization view of representation learning in a big data setting rather than thinking of them as dimensionality reduction techniques for a given fixed dataset. We will put forth a mathematical definition of unsupervised learning, lay down different objectives employed for unsupervised representation learning, and describe stochastic approximation algorithms they admit. Time permitting, we will discuss applications to speech and language processing, social media analytics, and to healthcare.

**Bio:** Raman Arora (http://www.cs.jhu.edu/~raman) is an assistant professor in the Department of Computer Science at Johns Hopkins University (JHU), where he as been since 2014. He is affiliated with the Center of Language and Speech Processing (CLSP) and the Institute for Data Intensive Engineering and Science (IDIES). Prior to joining JHU, he was a Research Assistant Professor at Toyota Technological Institute at Chicago (TTIC), a visiting researcher at Microsoft Research (MSR) Redmond, and a postdoctoral scholar at the University of Washington in Seattle. He received his M.S. and Ph.D. degrees from the University of Wisconsin-Madison. His research interests lie at the interface of machine learning and stochastic optimization, with emphasis on representation learning techniques including subspace learning, multiview learning, deep learning, and spectral learning. Central to his research is the theory and application of stochastic approximation algorithms that can scale to big data.

(an ML talk in the CIS Speaker Series)

Fri 04/07/17, 12:00pm, Hackerman B17**Sparse Non-Negative Matrix Language Modeling***Ciprian Chelba, Google Research*

**Abstract:** We present Sparse Non-negative Matrix (SNM), a novel probability estimation technique for language modeling that can efficiently incorporate arbitrary features in a similar way to the more established family of maximum entropy (exponential models). Due to its parsimonious parameterization the model can be estimated efficiently on small amounts of data. Experiments on various corpora show that the model matches established techniques in both perplexity and speech recognition accuracy. The computational advantages of SNM estimation over both maximum entropy and neural network estimation are probably its main strength, promising an approach that has large flexibility in combining arbitrary features and yet scales gracefully to large amounts of data.

**Bio:** Ciprian Chelba is a Research Scientist with Google. Between 2000 and 2006 he worked as a Researcher in the Speech Technology Group at Microsoft Research. He received his Diploma Engineer degree in 1993 from the Faculty of Electronics and Telecommunications at “Politechnica” University, Bucuresti, Romania, M.S. in 1996 and Ph.D. in 2000 from the Electrical and Computer Engineering Department at the Johns Hopkins University. His research interests are in statistical modeling of natural language and speech, as well as related areas such as machine learning and information theory as applied to natural language problems. Recent projects include language modeling for Google Voice Search and the Android soft keyboard.

(an ML talk in the CLSP Speaker Series)

Wed 04/05/17, 03:00pm, Whitehead 304**On Phase Transitions for Spiked Random Matrix and Tensor Models***Afonso Bandeira, Courant Institute – NYU*

**Abstract:** A central problem of random matrix theory is to understand the eigenvalues of spiked random matrix models, in which a prominent eigenvector (or low rank structure) is planted into a random matrix. These distributions form natural statistical models for principal component analysis (PCA) problems throughout the sciences, where the goal is often to recover or detect the planted low rank structured. In this talk we discuss fundamental limitations of statistical methods to perform these tasks and methods that outperform PCA at it. Emphasis will be given to low rank structures arising in Synchronization problems. Time permitting, analogous results for spiked tensor models will also be discussed. Joint work with: Amelia Perry, Alex Wein, and Ankur Moitra.

Fri 03/31/17, 12:00pm, Hackerman B17**Neural Approaches to Machine Reading Comprehension and Dialogue***Jianfeng Gao, Microsoft Research*

**Abstract:** In this talk, I start with a brief introduction to the history of symbolic approaches to natural language processing (NLP), and why we move to neural approaches recently. Then I describes in detail the deep learning technologies that are recently developed for two areas of NLP tasks. First is a set of neural attention and inference models developed for machine reading comprehension and question answering. Second is the use of deep learning for various of dialogue agents, including task-completion bots and social chat bots.

**Bio:** Jianfeng Gao is Partner Research Manager in Deep Learning Technology Center (DLTC) at Microsoft Research, Redmond. He works on deep learning for text and image processing and leads the development of AI systems for machine reading comprehension (MRC), question answering (QA), dialogue, and business applications. From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on Web search, query understanding and reformulation, ads prediction, and statistical machine translation. From 2005 to 2006, he was a research lead in Natural Interactive Services Division at Microsoft, where he worked on Project X, an effort of developing natural user interface for Windows. From 2000 to 2005, he was Research Lead in Natural Language Computing Group at Microsoft Research Asia. He, together with his colleagues, developed the first Chinese speech recognition system released with Microsoft Office, the Chinese/Japanese Input Method Editors (IME) which were the leading products in the market, and the natural language platform for Windows Vista.

(an ML talk in the CLSP Speaker Series)

Wed 03/29/17, 10:00am, HLTCOE North Conference Room – Stieff Building**The Limits of Unsupervised Syntax and the Importance of Grounding in Language Acquisition***Yonatan Bisk, University of Southern California*

**Abstract:** The future of self-driving cars, personal robots, smart homes, and intelligent assistants hinges on our ability to communicate with computers. The failures and miscommunications of Siri-style systems are untenable and become more problematic as machines become more pervasive and are given more control over our lives. Despite the creation of massive proprietary datasets to train dialogue systems, these systems still fail at the most basic tasks. Further, their reliance on big data is problematic. First, successes in English cannot be replicated in most of the 6,000+ languages of the world. Second, while big data has been a boon for supervised training methods, many of the most interesting tasks will never have enough labeled data to actually achieve our goals. It is therefore important that we build systems which can learn from naturally occurring data and grounded situated interactions. In this talk, I will discuss work from my thesis on the unsupervised acquisition of syntax which harnesses unlabeled text in over a dozen languages. This exploration leads us to novel insights into the limits of semantics-free language learning. Having isolated these stumbling blocks, I’ll then present my recent work on language grounding where we attempt to learn the meaning of several linguistic constructions via interaction with the world.

**Bio:** Yonatan Bisk’s research focuses on Natural Language Processing from naturally occurring data (unsupervised and weakly supervised data). He is a postdoc researcher with Daniel Marcu at USC’s Information Sciences Institute. Previously, he received his Ph.D. from the University of Illinois at Urbana-Champaign under Julia Hockenmaier and his BS from the University of Texas at Austin.

(an ML talk in the HLTCOE Speaker Series)

Tue 03/28/17, 01:30pm, Clark Hall 314**A Well-Tempered Landscape for Non-Convex Robust Subspace Recovery***Gilad Lerman, University of Minnesota*

**Abstract:** We present a mathematical analysis of a gradient descent method for Robust Subspace Recovery. The optimization is cast as a minimization over the Grassmannian manifold, and gradient steps are taken along geodesics. We show that under a generic condition, the energy landscape is nice enough for the non-convex gradient method to exactly recover an underlying subspace. The condition is shown to hold with high probability for a certain model of data. This work is joint with Tyler Maunu and Teng Zhang.

**Bio:** Lerman received his Ph.D. in Mathematics at Yale University in 2000 under the direction of Ronald Coifman and Peter Jones. His postdoctoral experience included Courant Instructorship (2000-2003) at New York University’s Courant Institute of Mathematical Sciences and training in bioinformatics as a research scientist in Bud Mishra’s Lab (2003-2004) atthe same institute. He was a recipient of an NSF CAREER award in 2010 and the Feinberg Foundation Visiting Faculty Fellowship at the Weizmann Institute in 2013. Lerman has extensive experience working with industry both as a consultant and collaborator. His areas of research and expertise include high dimensional data, machine learning, algorithm design, and mathematical foundations of data analysis. As director of the Data Science Lab, his goal is to provide industry with access to academic research and tools for analysis of data.

(an ML talk in the CIS Speaker Series)

Tue 03/28/17, 12:00pm, Hackerman B17**Thinking on Your Feet: Reinforcement Learning for Incremental Language Tasks***Jordan Boyd-Graber, University of Colorado*

**Abstract:** In this talk, I’ll discuss two real-world language applications that require “thinking on your feet”: synchronous machine translation (or “machine simultaneous interpretation”) and question answering (when questions are revealed one piece at a time). In both cases, effective algorithms for these tasks must interrupt the input stream and decide when to provide output. Synchronous machine translation is when a sentence is being produced one word at a time in a foreign language and we want to produce a translation in English simultaneously (i.e., with as little delay between a foreign language word and its English translation). This is particularly difficult in verb-final languages like German or Japanese, where an English translation can barely begin until the verb is seen. Effective translation thus requires predictions of unseen elements of the sentence (e.g., the main verb in German and Japanese, or relative clauses in Japanese, or post-positions in Japanese). We use reinforcement learning to decide when to trust our verb predictions. It must learn to balance incorrect translation versus timely translations, and must use those predictions to translate the sentence. For question answering, we use a specially designed dataset that challenges humans: a trivia game called quiz bowl. These questions are written so that they can be interrupted by someone who knows more about the answer; that is, harder clues are at the start of the question and easier clues are at the end of the question. We create a novel neural network system to predict answers from incomplete questions and use reinforcement learning to decide when to guess. We are able to answer questions earlier in the questions than most college trivia contestants.

**Bio:** Jordan Boyd-Graber is an assistant professor in the University of Colorado Boulder’s Computer Science Department (a Colorado native), formerly serving as an assistant professor at the University of Maryland. Before joining Maryland in 2010, he did his PhD with David Blei at Princeton. Jordan’s research focus is in applying machine learning and Bayesian probabilistic models to problems that help us better understand social interaction or the human cognitive process. He and his students have won “best of” awards at NIPS (2009, 2015), NAACL (2016), and CoNLL (2015), and Jordan won the British Computing Society’s 2015 Karen Spärck Jones Award and a 2017 NSF CAREER award. His research has been funded by DARPA, IARPA, NSF, NCSES, ARL, NIH, and Lockheed Martin and has been featured by CNN, Huffington Post, New York Magazine, and the Wall Street Journal.

(an ML talk in the CLSP Speaker Series)

Mon 03/27/17, 01:30pm, Hodson Hall Board Room**Mathematical Mysteries of Deep Neural Networks***Stephane Mallat, Ecole Normale Superieure*

**Abstract:** Classification and regression require to approximate functions in high dimensional spaces. Avoiding the dimensionality curse opens many questions in statistics, probability, harmonic analysis and geometry. Convolutional deep neural networks can obtain spectacular results for image analysis, speech understanding, natural languages and many other problems. We shall review their architecture and analyze their mathematical properties, with many open questions. We show that the architectures implement multiscale contractions, where wavelets have an important role, and they can learn groups of symmetries. This will be illustrated through applications to image and audio classification, but also to statistical physics and computations of molecular energies in quantum chemistry.

**Bio:** Stéphane Mallat received the Ph.D. degree in electrical engineering from the University of Pennsylvania, in 1988. He was then Professor at the Courant Institute of Mathematical Sciences, until 1994. In 1995, he became Professor in Applied Mathematics at Ecole Polytechnique, Paris and Department Chair in 2001. From 2001 to 2007 he was co-founder and CEO of a semiconductor start-up company. In 2012 he joined the Computer Science Department of Ecole Normale Supérieure, in Paris. Stéphane Mallat’s research interests include learning, signal processing, and harmonic analysis. He is a member of the French Academy of sciences, foreign member of the US National Academy of Engineering, an IEEE Fellow and a EUSIPCO Fellow. In 1997, he received the Outstanding Achievement Award from the SPIE Society and was a plenary lecturer at the International Congress of Mathematicians in 1998. He also received the 2004 European IST Grand prize, the 2004 INIST-CNRS prize for most cited French researcher in engineering and computer science, the 2007 EADS grand prize of the French Academy of Sciences, the 2013 Innovation medal of the CNRS, and the 2015 IEEE Signal Processing best sustaining paper award.

(an ML talk in the CIS Speaker Series)

Mon 03/27/17, 12:15pm, Room W2008**Nonparametric Spatial-Temporal Modelling of the Association Between Ambient Air Pollution and Adverse Pregnancy Outcomes***Montserrat Fuentes, Virginia Commonwealth University*

**Abstract:** Exposure to high levels of air pollution during the pregnancy is associated with increased probability of birth defects, a major cause of infant morbidity and mortality. New statistical methodology is required to specifically determine when a particular pollutant impacts the pregnancy outcome, to determine the role of different pollutants, and to characterize the spatial variability in these results. We introduce a new methodology for high dimensional environmental-health data. More specifically, we present a Bayesian spatial-temporal hierarchical multivariate probit regression model that identifies weeks during the first trimester of pregnancy which are impactful in terms of cardiac congenital anomaly development. The model is able to consider multiple pollutants and a multivariate cardiac anomaly grouping outcome jointly while allowing the critical windows to vary in a continuous manner across time and space. We utilize a dataset of numerical chemical model output which contains information regarding multiple species of particulate matter. Our introduction of an innovative spatial-temporal nonparametric prior distribution for the pollution risk effects allows for greater flexibility to identify critical weeks during pregnancy which are missed when more standard models are applied. We apply these methods to geocoded pregnancy outcomes in Texas.

(an ML talk in the Biostatistics Speaker Series)

Mon 03/27/17, 09:30am, HLTCOE North Conference Rm – Stieff Building**Probabilistic Models for Large, Noisy and Dynamic Data***Jay Pujara, University of California at Santa Cruz*

**Abstract:** We inhabit a vast, uncertain, and dynamic universe. To succeed in such an environment, artificial intelligence approaches must handle massive amounts of noisy, changing evidence. My research addresses the problems of building scalable, probabilistic models amenable to online updates. To illustrate the potential of such models, I present my work on knowledge graph identification, which jointly resolves the entities, attributes, and relationships in a knowledge graph by combining statistical NLP signals and semantic constraints. Using probabilistic soft logic, a statistical relational learning framework I helped develop, I demonstrate how knowledge graph identification can scale to millions of uncertain candidate facts and tens of millions of semantic dependencies in real-world data while achieving state-of-the-art performance. My work further extends this scalability by adopting a distributed computing approach, reducing the inference time of knowledge graph identification from two hours to ten minutes. Updating large, collective models like those used for knowledge graphs with new information poses a significant challenge. I develop a regret bound for probabilistic models and use this bound to motivate practical algorithms that support low-regret updates while improving inference time over 65%. Finally, I highlight several active projects in sustainability, bioinformatics, and mobile analytics that provide a promising foundation for future research.

**Bio:** Jay Pujara is a postdoctoral researcher at the University of California, Santa Cruz whose principal areas of research are machine learning, artificial intelligence, and data science. He completed his PhD at the University of Maryland, College Park and received his MS and BS at Carnegie Mellon University. Prior to his PhD, Jay spent six years at Yahoo! working on mail spam detection, user trust, and contextual mail experiences, and he has also worked at Google, LinkedIn and Oracle. Jay is the author of over twenty peer-reviewed publications and has received three best paper awards for his work. He is a recognized authority on knowledge graphs, and has organized the Automatic Knowledge Base Construction (AKBC) workshop, recently presented a tutorial on knowledge graph construction, and has had his work featured in AI Magazine. For more information, visit https://www.jaypujara.org

(an ML talk in the HLTCOE Speaker Series)

Thu 03/16/17, 01:30pm, Whitehead 304**Nuke the Clouds: Using Nuclear Norm Optimization to Remove Clouds from Satellite Images***Peder Olsen, IBM*

**Abstract:** We discuss how to use the nuclear norm and matrix factorization techniques to remove clouds from satellite images. The talk will focus on discussing the key properties and variational inequalities that is commonly used in minimizing convex functions with a nuclear norm term. We will also contrast the convex formulations with the corresponding rank constrained problems that are highly non-convex, but which are sometimes simpler to solve regardless. Finally we will show a lot of examples/demos of how this is working in practice

(an ML talk in the AMS Speaker Series)

Wed 03/08/17, 12:00pm, Malone 228**Time-Space Hardness of Learning Sparse Parities***Avishay Tal, Princeton*

**Abstract:** How can one learn a parity function, i.e., a function of the form $f(x) = a_1 x_1 + a_2 x_2 + … + a_n x_n (mod 2)$ where a_1, …, a_n are in {0,1}, from random labeled examples? One approach is to gather O(n) random labeled examples and perform Gaussian-elimination. This requires a memory of size O(n^2) and poly(n) time. Another approach is to go over all possible 2^n parity functions and to verify them by checking O(n) random examples per each possibility. This requires a memory of size O(n), but O(2^n * n) time. In a recent work, Raz [FOCS, 2016] showed that if an algorithm has memory of size much smaller than n^2, then it has to spend exponential time in order to learn a parity function. In other words, fast learning requires a good memory. In this work, we show that even if the parity function is known to be extremely sparse, where only log(n) of the a_i’s are nonzero, then the learning task is still time-space hard. That is, we show that any algorithm with linear size memory and polynomial time fails to learn log(n)-sparse parities. Consequently, the classical tasks of learning linear-size DNF formulae, linear-size decision trees, and logarithmic-size juntas are all time-space hard. Based on joint work with Gillat Kol and Ran Raz.

**Bio:** Avishay Tal is a postdoctoral researcher at the Institute for Advanced Study (IAS) at Princeton. He joined IAS in 2015 after receiving a Ph.D. in computer science from Weizmann Institute, advised by Prof. Ran Raz. In 2012, he received an M.Sc. in computer science from the Technion, advised by Prof. Amir Shpilka. His research focuses on computational complexity, with emphasis on analysis of Boolean functions and circuit complexity.

(an ML talk in the CS Speaker Series)

Fri 03/03/17, 05:00pm, Clark Hall 314**Theoretical Guarantees for Convolutional Sparse Coding, and a Look into Convolutional Neural Networks***Jeremias Sulam, Israel Institute of Technology*

**Abstract:** Within the wide field of sparse approximation, convolutional sparse coding (CSC) has gained increasing attention in recent years, assuming a structured dictionary built as a union of banded Circulant matrices. While several works have been devoted to the practical aspects this model, a systematic theoretical understanding of CSC seems to have been left aside. In this talk I will present a novel analysis of CSC problem, based on the observation that while being global, this model can be characterized and analyzed locally. By imposing only local sparsity conditions, we show that uniqueness of solutions, stability to noise contamination and success of pursuit algorithms (both greedy and convex-relaxations) are globally guaranteed. These new results are much stronger and informative than those obtained by deploying the classical sparse theory. Finally, I will briefly present a multi-layer extension of this model and show that it is closely related to Convolutional Neural Networks (CNN). This connection brings a fresh view to CNN, as one can to attribute to this architecture theoretical claims under simple local sparse assumptions. This, in turn, will shed light on ways of improving the design and implementation of algorithms for CNN.

(an ML talk in the CIS Speaker Series)

Fri 03/03/17, 09:00am, Clark Hall 314**Up-Scaling Dictionary Learning***Jeremias Sulam, Israel Institute of Technology*

**Abstract:** Sparse approximation and dictionary learning have been applied with great success to various image processing tasks, often leading to state-of-the-art results. Yet, these methods have traditionally been restricted to small dimensions due to the computational constraints that these problems entail. In this talk, I will first give a brief introduction to the topic of dictionary learning, describing the basics of typical algorithms and reviewing some of their results and limitations. In order to go beyond the small patches in sparsity-based signal and image processing, I will then present a recent work demonstrating how to efficiently handle larger dimensions. This work employs a new cropped wavelets dictionary, which enables a multi-scale decomposition with virtually no border effects. We then employ this dictionary within a double sparsity model while leveraging the scaling properties of online learning. The resulting large trainable atoms, coined trainlets, not only achieve state-of-the-art performance in dictionary learning, but also open the door to new challenges and problems that remained unattainable until now.

(an ML talk in the CIS Speaker Series)

Thu 03/02/17, 01:30pm, Whitehead 304**Hierarchical Bayesian Modeling of Cosmic Populations: Why and How***Tom Loredo, Cornell University*

Fri 02/24/17, 01:30pm, 314 Clark Hall**Object Categorization in Real World***Xilin Chen, Chinese Academy of Sciences*

**Abstract:** Object categorization and scene understanding is a challenge in computer vision. In the past two decades, computer vision has made some progress in major tasks, such as face recognition. However, it’s still a big challenge to understand objects and their relationship in the real world. In this talk, I will discuss some issues on object categorization. As an object exists in an ecosystem related to its property from many aspects, we model categorization as a similarity measurement in high-dimensional semantic space. This representation provides a natural hierarchical way to describe objects, and it’s easy to support further tasks, such as zero shot object understanding, image captioning and visual QA. Extensive experiments are conducted on both face and general image sets. The results show the promising performance.

**Bio:** Dr. Xilin Chen is a professor with Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing. He is a IEEE /IAPR / CCF Fellow. Dr. Xilin Chen’s research interests include Computer Vision, Pattern Recognition, Image Processing, Multimodal Interface. He has co-authored more than 200 papers. He is an associate editor of IEEE Transactions on Multimedia, a leading editor of Journal of Computer Science of Technology, and an associate editor in chief of the Chinese Journal of Computer. He was an associate editor of IEEE Transactions on Image Processing from 2009 to 2014. He serves as general chair of IEEE FG 2013 / FG 2018, program chair of ACM ICMI 2010, tutorial chair of IEEE FG 2011, publicity chair of ACM ICMI 2015, workshop chair of ACM ICMI 2009, demo chair of ACM ICMI 2006, and local chair of IEEE ICIP 2017, ACM MM 2009 and ICME 2007. He also serves as a program committee member for more than 60 International Conferences in related areas, including ICCV, CVPR, ICIP, ICPR, etc. He is a recipient of one China’s State Natural Science Award (2015) and four China’s State Scientific and Technological Progress Awards (2012, 2005, 2003, and 2000).

(an ML talk in the CIS Speaker Series)

Thu 02/23/17, 04:00pm, Mason Hall Auditorium**Geometric Methods for the Approximation of High-Dimensional Dynamical Systems***Mauro Maggioni, JHU*

**Abstract:** We discuss a geometry-based statistical learning frame- work for performing model reduction and modeling of stochastic high-dimensional dynamical systems. We consider two complementary settings. In the first one, we are given long trajectories of a system, e.g. from molecular dynamics, and we discuss new techniques for estimating, in a robust fashion, an effective number of degrees of freedom of the system, which may vary in the state space of then system, and a local scale where the dynamics is well-approximated by a reduced dynamics with a small number of degrees of freedom. We then use these ideas to produce an approximation to the generator of the system and obtain, via eigenfunctions of an empirical Fokker-Planck question, reaction coordinates for the system that capture the large time behavior of the dynamics. We present various examples from molecular dynamics illustrating these ideas. In the second setting we only have access to a (large number of expensive) simulators that can return short simulations of high-dimensional stochastic system, and introduce a novel statistical learning framework for learning automatically a family of local approximations to the system, that can be (automatically) pieced together to form a fast global reduced model for the system, called ATLAS. ATLAS is guaranteed to be accurate (in the sense of producing stochastic paths whose distribution is close to that of paths generated by the original system) not only at small time scales, but also at large time scales, under suitable assumptions on the dynamics. We discuss applications to homogenization of rough diffusions in low and high dimensions, as well as relatively simple systems with separations of time scales, stochastic or chaotic, that are well-approximated by stochastic differential equations.

Tue 02/21/17, 01:30pm, Krieger 143**Objective Functionals of Machine Learning***Dejan Slepcev, CMU*

**Abstract:** We will discuss variational problems (objective functionals) arising in machine learning. In particular the ones relevant for clustering, semi-supervised learning, and dimensionality reduction. We will explore the connection between discrete functionals which deal with available data samples and continuum functionals which model the task when full information is available. We will discuss the mathematical structures needed to bridge the discrete and continuum worlds and the conclusions that arise. We will also discuss functionals which deal with finding the best one-dimensional approximation to a data set, and investigate their properties and applications.

Mon 02/20/17, 03:00pm, Krieger 413**Data-Driven Mathematical Analysis and Scientific Computing for Oscillatory Data***Haizhao Yang, Duke University*

**Abstract:** Large amounts of data now stream from daily life; data analytics has been helping to discover hidden patterns, correlations and other insights. This talk introduces the mode decomposition problem in the analysis of oscillatory data. This problem aims at identifying and separating pre-assumed data patterns from their superposition. It has motivated new mathematical theory and scientific computing tools in applied harmonic analysis. These methods are already leading to interesting and useful results, e.g., electronic health record analysis, microscopy image analysis in materials science, art and history.

Fri 02/17/17, 12:00pm, Krieger 413**Data-Driven Stochastic Model Reduction***Fei Lu, University of California Berkeley*

**Abstract:** The need to infer reduced computational models of complex systems from discrete partial observations arises in many scientific and engineering applications, for example in climate prediction, materials science, and biology. The challenges come mainly from memory effects due to unresolved scales, from nonlinear interactions between resolved and unresolved scales, and from the difficulty in drawing inferences from discrete partial data. We address these challenges by a discrete-time stochastic parametrization method, and demonstrate by examples that the resulting stochastic reduced models can capture the key statistical dynamical features of the full system and make accurate short-term predictions. The examples include the Lorenz 96 system (which is a simplified model of the atmosphere) and the Kuramoto-Sivashinsky equation that describes spatiotemporally chaotic dynamics.

Tue 02/14/17, 12:00pm, Hackerman B17**Algorithmic Bias in Artificial Intelligence: The Seen and Unseen Factors Influencing Machine Perception of Images and Language***Margaret Mitchell, Google Research*

**Abstract:** The success of machine learning has recently surged, with similar algorithmic approaches effectively solving a variety of human-defined tasks. Tasks testing how well machines can perceive images and communicate about them have exposed strong effects of different types of bias, such as selection bias and dataset bias. In this talk, I will unpack some of these biases, and how they affect machine perception today. I will introduce and detail the first computational model to leverage human Reporting Bias — what people mention — in order to learn ground-truth facts about the visual world.

**Bio:** I am a Senior Research Scientist in Google’s Research & Machine Intelligence group, working on advancing artificial intelligence towards positive goals, as well as ethics in AI and demographic diversity of researchers. My research is on vision-language and grounded language generation, focusing on how to help computers communicate based on what they can process. My work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science. Before Google, I was a founding member of Microsoft Research’s “Cognition” group, focused on advancing vision-language artificial intelligence. Before MSR, I was a postdoctoral researcher at The Johns Hopkins University Center of Excellence, where I mainly focused on semantic role labeling and sentiment analysis using graphical models, working under Benjamin Van Durme. Before that, I was a postgraduate (PhD) student in the natural language generation (NLG) group at the University of Aberdeen, where I focused on how to naturally refer to visible, everyday objects. I primarily worked with Kees van Deemter and Ehud Reiter. I spent a good chunk of 2008 getting a Master’s in Computational Linguistics at the University of Washington, studying under Emily Bender and Fei Xia. Simultaneously (2005 – 2012), I worked on and off at the Center for Spoken Language Understanding, part of OHSU, in Portland, Oregon. My title changed with time (research assistant/associate/visiting scholar), but throughout, I worked on technology that leverages syntactic and phonetic characteristics to aid those with neurological disorders under Brian Roark. I continue to balance my time between language generation, applications for clinical domains, and core AI research.

(an ML talk in the CLSP Speaker Series)

Thu 02/09/17, 04:00pm, Mudd Hall, Room 100**Mapping Behavior to Neural Anatomy Using Machine Vision and Thermogenetics***Kristin M. Branson, HHMI, Janelia Research Campus*

**Abstract:** Assigning behavioral functions to neural structures has long been a central goal in neuroscience, and is a necessary first step toward a circuit-level understanding of how the brain generates behavior. Here we map the neural substrates of locomotion and social behaviors for Drosophila melanogaster using automated machine-vision and -learning techniques. From videos of 400,000 flies, we quantified the behavioral effects of activating 2,200 genetically targeted populations of neurons. We combined a novel quantification of anatomy with our behavioral analysis to create brain-behavior correlation maps, which are shared as browsable webpages and interactive software. Based on these maps, we discovered regions of the brain causally related to sensory processing, locomotor control, courtship, aggression, and sleep. Our maps directly specify genetic tools to target these regions, which we used to identify a small population of neurons with a role in the control of walking.

Tue 02/07/17, 01:30pm, 314 Clark Hall**Low-dimensional Manifold Models for Image Registration and Bayesian Statistical Shape Analysis***Miaomiao Zhang, MIT CSAIL*

**Abstract:** Investigating clinical hypotheses of diseases and their potential therapeutic implications based on large medical image collections is an important research area in medical imaging. Medical images provide an insight about anatomical changes caused by diseases; hence is critical to disease diagnosis and treatment planning. Characterization of the anatomical changes poses computational and statistical challenges due to the high-dimensional and nonlinear nature of the data, as well as a vast number of unknown model parameters. In this talk, I will present efficient, robust, and reliable methods to address these problems. My approach entails (i) developing a low-dimensional shape descriptor to represent anatomical changes in large-scale image data sets, and (ii) novel Bayesian machine learning methods for analyzing the intrinsic variability of high-dimensional manifold-valued data with automatic dimensionality reduction and parameter estimation. The potential practical applications of this work beyond medical imaging include machine learning, computer vision, and computer graphics.

**Bio:** Miaomiao Zhang is a postdoctoral associate in Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. She completed her PhD in the Computer Science Department at University of Utah. Her research work focuses on developing novel models at the intersection of statistics, mathematics, and computer engineering in the field of medical and biological imaging. Miaomiao Zhang received the Young Scientist Award at the Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2014 and was a runnerup for the same award at MICCAI 2016

(an ML talk in the CIS Speaker Series)

Mon 02/06/17, 12:15pm, Room W2008**Bayesian Model Calibration and Prediction Applied to (Stochastic) Epidemic Simulations***David Higdon, Biocoplexity Institute of Virginia Tech*

**Abstract:** Dynamic network simulation models can evolve a very large system of agents over a network that evolves with time. Such models are often used to simulate epidemics or transportation, typically producing random trajectories, even when model parameters and initial conditions are identical. This introduces a number of challenges in designing ensembles of model runs for sensitivity analysis and computer model calibration. This talk will go through a case study of a recent epidemic, seeking to forecast the epidemic’s behavior given initial administrative information. I’ll discuss two different approaches for combining observations with this model for estimation and prediction. The first uses methods from traditional computer model calibration; the second approach – still under development – uses a sequential Monte Carlo approach.

(an ML talk in the Biostatistics Speaker Series)

Tue 01/31/17, 01:30pm, 314 Clark Hall**Zero-Shot Learning for Object Recognition in the Wild***Fei Sha, University of Southern California*

**Abstract:** Can computer vision systems be deployed to the wild where they can recognize (unanticipated) things of which they have not been told in their design stage? Zero-shot learning (ZSL) has emerged as a promising paradigm to address this challenge. Given semantic descriptions of object classes, zero-shot learning aims to accurately recognize objects of the unseen classes, from which no examples are available at the training stage, by associating them to the seen classes, where labeled examples are provided. In this talk, I will describe several efforts in my lab in developing learning methods for ZSL. Our main idea is to align the semantic space that is derived from external information to the model space that concerns itself with recognizing visual features. We demonstrate superior accuracy of our approaches over the state-of-the-art on several benchmark datasets for zero-shot learning, including the full ImageNet Fall 2011 dataset with more than 20,000 unseen classes.

**Bio:** Dr. Fei Sha is an associate professor at U. of Southern California. His primary research interests are machine learning and its application to speech and language processing, computer vision, and robotics. He had won outstanding student paper awards at NIPS 2006 and ICML 2004. He was selected as a Sloan Research Fellow in 2013, won an Army Research Office Young Investigator Award in 2012, and was a member of DARPA 2010 Computer Science Study Panel. He has a Ph.D (2007) from Computer and Information Science from U. of Pennsylvania and B.Sc and M.Sc from Southeast University (Nanjing, China). http://www-bcf.usc.edu/~feisha/

(an ML talk in the CIS Speaker Series)

Thu 01/19/17, 11:00am, Malone 107**Edward: A library for probabilistic modeling, inference, and criticism***Dustin Tran, Columbia University*

**Abstract:** Probabilistic modeling is a powerful approach for analyzing empirical information. In this talk, I will provide an overview of Edward, a software library for probabilistic modeling. Formally, Edward is a probabilistic programming system built on computational graphs, supporting compositions of both models and inference for flexible experimentation. For example, Edward makes it easy to fit the same model using a variety of composable inferences, ranging from point estimation, to variational inference, to MCMC. Edward is also integrated into TensorFlow, providing significant speedups over existing probabilistic systems. As examples, I will show how Edward can be leveraged for expanding the frontier of variational inference and deep generative models. Joint work with Alp Kucukelbir, Adji Dieng, Maja Rudolph, Dawen Liang, Matt Hoffman, Kevin Murphy, Eugene Brevdo, Ryan Rifkin, and David Blei.

Thu 12/01/16, 02:00pm, Clark 314**Advances in Algebraic Subspace Clustering and Dual Principal Component Pursuit***Manolis Tsakiris, JHU*

**Abstract:** Dissertation defense committee: Dr. René Vidal (advisor-first reader/BME), Dr. Daniel P. Robinson (second reader/AMS), Dr. Trac D. Tran, Dr. Sanjeev Khudanpur, Dr. John Wright (Electrical Engineering, Columbia University) Recent years have witnessed an explosion of data availability in a variety of scientific fields, enabled by advances in sensor technology and distributed hardware systems. This has given rise to the challenge of efficiently processing such data for performing various tasks, such as counting the number of moving objects in a video. Towards that end, two operations on the data are of fundamental importance, i.e., dimensionality reduction and clustering, and the idea of learning one or more linear subspaces from the data has proved a fruitful notion in that context. Nevertheless, state-of-the-art subspace learning methods, such as Robust Principal Component Analysis (RPCA) or Sparse Subspace Clustering (SSC), operate under the hypothesis that the underlying subspaces have small dimensions, and start to fail when these dimensions become large. This thesis attempts to advance the state-of-the-art of subspace learning methods in the regime of subspace dimensions comparable to the ambient dimension. The first major contribution of this thesis is an algorithm called Filtrated Algebraic Subspace Clustering (FASC), which builds upon algebraic geometric ideas of an older subspace clustering algorithm, known as Generalized Principal Component Analysis (GPCA). Even though GPCA is naturally suited for subspaces of maximal relative dimension, i.e., for hyperplanes, it suffers from two crucial weaknesses: sensitivity to noise and computational complexity. This thesis demonstrates that FASC addresses successfully the robustness to noise. This is achieved through an equivalent formulation of GPCA, which uses the idea of filtrations of unions of subspaces. A rigorous algebraic geometric analysis establishes the theoretical equivalence of the two methods, while experiments on synthetic and real data reveal that FASC not only dramatically improves upon the performance of GPCA, but it improves upon existing methods on several occasions. The second major contribution of this thesis is a single subspace learning algorithm called Dual Principal Component Pursuit (DPCP), which solves the robust PCA problem in the presence of outliers. Contrary to sparse and low-rank state-of-the-art methods, the theoretical guarantees of DPCP do not place any constraints on the dimension of the subspace. In particular, DPCP computes the orthogonal complement of the subspace, thus it is particularly suited for subspaces of low codimension (e.g, hyperplanes). This is done by solving a non-convex cosparse problem on the sphere, whose global minimizers are shown to be vectors normal to the subspace. Algorithms for solving the non-convex problem are developed and tested on synthetic and real data, showing that DPCP is able to handle higher subspace dimensions and larger amounts of outliers than existing methods. Finally, DPCP is extended theoretically and algorithmically to the case of multiple subspaces.

Thu 12/01/16, 01:30pm, Whitehead 304**Spectral Clustering for Dynamic Stochastic Block Model***Sharmodeep Bhattacharyya, Oregon State University*

**Abstract:** One of the most common and crucial aspects of many network data sets is the dependence of network link structure on time. In this work, we extend the existing (static) nonparametric latent variable model in the context of time-varying networks, and thereby propose a class of dynamic network models. For some special cases of these models (namely the dynamic stochastic block model and dynamic degree corrected block model), which assume that there is a common clustering structure for all networks, we consider the problem of identifying the common clustering structure. We propose two extensions of the (standard) spectral clustering method for the dynamic network models, and give theoretical guarantee that the spectral clustering methods produce consistent community detection in case of both dynamic stochastic block model and dynamic degree-corrected block model. The methods are shown to work under sufficiently mild conditions on the number of time snapshots to detect both associative and dissociative community structure, even if all the individual networks are very sparse and most of the individual networks are below community detectability threshold. We reinforce the validity of the theoretical results via simulations too. (Joint work with Shirshendu Chatterjee, CUNY)

(an ML talk in the AMS Speaker Series)

Wed 11/30/16, 03:00pm, Krieger 309**Measure transport approaches for Bayesian computation***Youssef Marzouk, MIT*

**Abstract:** We will discuss how transport maps, i.e., deterministic couplings between probability measures, can enable useful new approaches to Bayesian computation. A first use involves a combination of optimal transport and Metropolis correction; here, we use continuous transportation to transform typical MCMC proposals into adapted non-Gaussian proposals, both local and global. Second, we discuss a variational approach to Bayesian inference that constructs a deterministic transport map from a reference distribution to the posterior, without resorting to MCMC. Independent and unweighted samples can then be obtained by pushing forward reference samples through the map. Making either approach efficient in high dimensions, however, requires identifying and exploiting low-dimensional structure. We present new results relating the sparsity and decomposability of transports to the conditional independence structure of the target distribution. We also describe conditions, common in inverse problems, under which transport maps have a particular low-rank or near-identity structure. In general, these properties of transports can yield more efficient algorithms. As a particular example, we derive new deterministic “online” algorithms for Bayesian inference in nonlinear and non-Gaussian state-space models with static parameters. This is joint work with Daniele Bigoni, Matthew Parno, and Alessio Spantini.

Fri 11/18/16, 12:00pm, School of Pubic Health W3008**Doubly Robust Survival Trees and Forests***Jon Steingrimsson, JHU Biostatistics*

**Abstract:** Survival trees use recursive partitioning to separate patients into distinct risk groups. Survival forest procedures average multiple survival trees creating more flexible prediction models. In the absence of censoring, the corresponding algorithms rely heavily on the choice of loss function used in the decision making process. Motivated by semiparametric efficiency theory, we replace the loss function used in the absence of censoring by doubly robust loss functions. We derive properties of the doubly robust loss functions and show how the doubly robust survival trees and forest algorithms can be implemented using a certain form of response imputation. Furthermore, we discuss practical issues related to implementation of the algorithms. The performance of the resulting survival trees and forests is evaluated through simulation studies and through analyzing data on death from myocardial infraction.

(an ML talk in the Biostatistics Speaker Series)

Fri 11/18/16, 12:00pm, Hackerman B17**Flexible Models for Microclustering with Application to Entity Resolution***Rebecca Steorts, Duke University*

**Abstract:** Record linkage merges together large, potentially noisy databases to remove duplicate entities. Community detection is the process of placing entities into similar partitions or “communities.” Both applications are important to applications in author disambiguation, genetics, official statistics, human rights conflict, and others. It is common to treat record linkage and community detection as clustering tasks. In fact, generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman–Yor process mixture models make this assumption. For example, when performing record linkage, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the “microclustering property” and discussing a new model that exhibits this property. We illustrate this on real and simulated data. This is joint work with Giacomo Zanella Brenda Betancourt, Jeff Miller, Hanna Wallach, and Abbas Zaidi. https://arxiv.org/pdf/1610.09780.pdf

**Bio:** Rebecca C. Steorts received her B.S. in Mathematics in 2005 from Davidson College, her MS in Mathematical Sciences in 2007 from Clemson University, and her PhD in 2012 from the Department of Statistics at the University of Florida under the supervision of Malay Ghosh. She was a Visiting Assistant Professor in 2012–2015, where she worked closely with Stephen E. Fienberg. She is currently an Assistant Professor in the Department of Statistical Science at Duke University. Rebecca was named to MIT Technology Review’s 35 Innovators Under 35 for 2015 as a humantarian in the field of software. Her work was profiled in the Septmember/October issue of MIT Technology Review and she was recognized at a special ceremony along with an invited talk at EmTech in November 2015. In addition, Rebecca is a recepient of the Metaknowledge Network Templeton Foundation Grant, a National Science Founaton (NSF) SES grant, the University of Florida (UF) Graduate Alumni Fellowship Award, the U.S. Census Bureau Dissertation Fellowship Award, and the UF Innovation through Institutional Integration Program (I-Cubed) and NSF for development of an introductory Bayesian course for undergraduates. Rebeccahas been awarded Honorable Mention (second place) for the 2012 Leonard J. Savage Thesis Award in Applied Methodology. Her research interests are in large scale clustering and record linkage for computational social science applications.

(an ML talk in the CLSP Speaker Series)

Mon 11/14/16, 01:30pm, Charles Commons Conference Center**Diffusion-based Interactions in Noisy Single Cell Data***Smita Krishnaswamy, Yale School of Medicine*

**Abstract:** Recently, there have been significant advances in single-cell genomic and proteomic technologies that can measure the expression of thousands of mRNA transcripts and dozens of proteins. However, this data suffers from sparsity and noise. Furthermore, its high dimensionality makes interpreting the data difficult for biologists. Our aim is to facilitate interpretation by providing a set of tools and novel algorithms that allow biologists to extract meaningful and predictive information from the data, and which yield clear and concise visualizations. In my lab we utilize a diffusion framework to learn the manifold geometry of the data. This framework models the local affinity between high-dimensional data points using a kernel function and then utilizes graph diffusion to form long range connections and paths through the data. Our framework has used to impute and correct noisy single-cell RNA-sequencing data, using a method called MAGIC that utilizes the Markov diffusion operator that is part of the framework that models cellular neighborhoods. Then we extend this method to a family of novel transformations and algorithms performed on the Markov diffusion operator in order to emphasize several types of patterns in the data. First, we perform high dimensional interaction analysis between genes, pathways and modules using mutual information estimated from the high dimensional density analysis estimated via the steady state eigenvector of the diffusion operator. Secondly, we derive a new embedding, which we call PHATE, to map progressions in the data via a transformation and re-embedding of the diffusion operator. In our re-embedding, paths of data progression are emphasized to reveal differentiation and cell-state trajectories, as well as relationships between genes. Then, we propose a data condensation process by which we continually change the diffusion process such that cluster structures at all scales are revealed. We show our algorithms on several biological systems. However, due to the generic nature of! our methods they can be used in any system to enable the efficient and widespread analysis of single-cell data by revealing several types of structures in single-cell data.

Thu 11/03/16, 01:30pm, Whitehead 304**An Introduction to Distance Preserving Projections of Smooth Manifolds***Mark Iwen, Michigan State University*

**Abstract:** Manifold-based image models are assumed in many engineering applications involving imaging and image classification. In the setting of image classification, in particular, proposed designs for small and cheap cameras motivate compressive imaging applications involving manifolds. Interesting mathematics results when one considers that the problem one needs to solve in this setting ultimately involves questions concerning how well one can embed a low-dimensional smooth sub-manifold of high-dimensional Euclidean space into a much lower dimensional space without knowing any of its detailed structure. We will motivate this problem and discuss how one might accomplish this seemingly difficult task using random projections. Little if any prerequisites will be assumed beyond linear algebra and some probability.

(an ML talk in the AMS Speaker Series)

Wed 11/02/16, 12:00am, Krieger 309**Variational Problems on Graphs and their Continuum Limits***Dejan Slepcev, Carnegie Mellon University*

**Abstract:** We will discuss variational problems arising in machine learning and their limits as the number of data points goes to infinity. Consider point clouds obtained as random samples of an underlying “ground-truth” measure. Graph representing the point cloud is obtained by assigning weights to edges based on the distance between the points. Many machine learning tasks, such as clustering and classification, can be posed as minimizing functional on such graphs. We consider functionals involving graph cuts and graph laplacians and their limits as the number of data points goes to infinity. In particular we establish under what conditions the minimizers of discrete problems have a well defined continuum limit, and characterize the limit. The talk is primarily based on joint work with Nicolas Garcia Trillos, as well as on works with Xavier Bresson, Moritz Gerlach, Matthias Hein, Thomas Laurent, James von Brecht and Matt Thorpe.

Tue 11/01/16, 01:30pm, Clark Hall 314**3D Object Geometry from Single Image***Xiaowei Zhou, University of Pennsylvania*

**Abstract:** The past few years have witnessed remarkable advancements in 2D image understanding driven by deep learning. But when it comes to the scenarios involving interactions between human, robot and the world, we need to extract not only 2D visual information but also 3D geometry of the scene from images. This talk focuses on our recent progresses on recovering 3D object geometry, such as 3D structure and pose of rigid and articulated objects, from monocular imagery by levering data-driven representations learned from both 2D and 3D data. In particular, I will first introduce an approach to 3D human pose reconstruction based on the sparse representation of 3D human poses and CNN-based 2D pose predictions. I will discuss how to jointly optimize structural and viewpoint parameters with convex programming and how to account for the uncertainties in 2D pose predictions with an EM algorithm. Next, I will show that, using semantic keypoints and CAD models, it is feasible to estimate the 6-DoF pose of a rigid object from a cluttered image with a precision allowing a robot to grasp the object. Finally, I will introduce how to build object-class models from images of different instances with a multi-image matching algorithm that optimizes the cycle consistency of feature correspondences.

**Bio:** Xiaowei Zhou is a Postdoctoral Researcher in the Computer and Information Science Department at University of Pennsylvania. His research interests are on computer vision and robotics with a focus on object pose estimation, shape reconstruction, human pose estimation and data association. His current work attempts to combine 3D geometry, deep learning, optimization and statistical approaches to extract both semantic and geometric information of 3D scene from visual data. Xiaowei Zhou obtained his Bachelor’s degree in Optical Engineering from Zhejiang University, 2008, and his PhD degree in Electronic and Computer Engineering from The Hong Kong University of Science and Technology, 2013.

(an ML talk in the CIS Speaker Series)

Tue 11/01/16, 11:00am, Clark Hall 110**Change Point Estimation of Brain Shape Data in Relation with Alzheimer;s Disease***Laurent Younes, Johns Hopkins University*

**Abstract:** The manifestation of an event, such as the onset of a disease, is not always immediate and often requires some time for its repercussions to become observable. Slowly progressing diseases, and in particular neuro-degenerative disorders such as Alzheimer’s disease (AD), fall into this category. The manifestation of such diseases is related to the onset of cognitive or functional impairment and, at the time when this occurs, the disease may have already had been affecting the brain anatomically and functionally for a considerable time. We consider a statistical two-phase regression model in which the change point of a disease biomarker is measured relative to another point in time, such as the manifestation of the disease, which is subject to right-censoring (i.e., possibly unobserved over the entire course of the study). We develop point estimation methods for this model, based on maximum likelihood, and bootstrap validation methods. The effectiveness of our approach is illustrated by numerical simulations, and by the estimation of a change point for atrophy in the context of Alzheimer’s disease, wherein it is related to the cognitive manifestation of the disease. This work is a collaboration with Marilyn Albert, Xiaoying Tang and Michael Miller, and was partially supported by the NIH.

Thu 10/27/16, 01:30pm, Whitehead 304**Adaptive Contrast Weighted Learning and Tree-based Reinforcement Learning for Multi-Stage Multi-Treatment Decision-Making***Lu Wang, University of Michigan*

**Abstract:** Dynamic treatment regimes (DTRs) are sequential decision rules that focus simultaneously on treatment individualization and adaptation over time. We develop robust and flexible semiparametric and machine learning methods for estimating optimal DTRs. In this talk, we present a dynamic statistical learning method, adaptive contrast weighted learning (ACWL), which combines doubly robust semiparametric regression estimators with flexible machine learning methods. ACWL can handle multiple treatments at each stage and does not require prespecifying candidate DTRs. At each stage, we develop robust semiparametric regression-based contrasts with the adaptation of treatment effect ordering for each patient, and the adaptive contrasts simplify the problem of optimization with multiple treatment comparisons to a weighted classification problem that can be solved with existing machine learning techniques. We further develop a tree-based reinforcement learning (T-RL) method to directly estimate optimal DTRs in a multi-stage multi-treatment setting. At each stage, T-RL builds an unsupervised decision tree that maintains the nature of batch-mode reinforcement learning. Unlike ACWL, T-RL handles the optimization problem with multiple treatment comparisons directly through the purity measure constructed with augmented inverse probability weighted estimators. By combining robust semiparametric regression with flexible tree-based learning, T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs. However, ACWL seems more robust to tree-type misspecification than T-RL when the true optimal DTR is non-tree-type. We illustrate the performances of both methods in simulations and case studies.

**Bio:** Lu Wang is Associate Professor of Biostatistics at University of Michigan. She received her Ph.D in Biostatistics from Harvard University in 2008 and joined the faculty at the University of Michigan in the same year. Dr. Wang’s research focuses on statistical methods for evaluating dynamic treatment regimes, personalized health care, missing data analysis, nonparametric and semiparametric regressions, and longitudinal (correlated/clustered) data analysis. Dr. Wang received University of Michigan Injury Center Research Award in 2013, Elizabeth C. Crosby Research Award in 2014, and John G Searle Assistant Professorship in 2015. Many research work with her doctoral students received prestigious awards from ENAR, JSM, and the Society for Clinical Trials. Besides methodology research, she has also collaborating with many investigators at M.D. Anderson Cancer Center, University of Michigan Medical School, and Harvard School of Public Health. Dr. Wang has served as Associated Editor for Biometrics since 2013.

Wed 10/26/16, 01:30pm, Clark Hall 314**My Aventures with Bayes: In Search of Optimal Solutions in Machine Learning, Computer Vision and Beyond***Aleix M. Martinez, Ohio State University*

**Abstract:** The Bayes criterion is generally regarded as the holy grail in classification because, for known distributions, it leads to the smallest possible classification error. Unfortunately, the Bayes classification boundary is generally nonlinear and its associated error can only be calculated under unrealistic assumptions. In this talk, we will show how these obstacles can be readily and efficiently averted yielding Bayes optimal algorithms in machine learning, statistics, computer vision and other areas of scientific inquiry. In this journey, we will extend the notion of homoscedasticity (meaning of the same variance) to spherical-homoscedasticity (meaning of the same variance up to a rotation) and show how this allows us to generalize the Bayes criterion under more realistic assumptions. This will lead to a new concept of kernel mappings with applications in classification (machine learning), shape analysis (statistics), structure from motion (computer vision), brain alignment (neuroscience), and others. We will then define other optimization criteria where Bayes cannot be readily applied and define the use of kernels in labeled graphs.

(an ML talk in the CIS Speaker Series)

Wed 10/26/16, 12:00pm, Malone 228**Privacy, Information and Generalization***Adam Smith, Penn State*

**Abstract:** Consider an agency holding a large database of sensitive personal information — medical records, census survey answers, web search records, or genetic data, for example. The agency would like to discover and publicly release global characteristics of the data (say, to inform policy or business decisions) while protecting the privacy of individuals’ records. I will begin by discussing what makes this problem difficult, and exhibit some of the nontrivial issues that plague simple attempts at anonymization and aggregation. Motivated by this, I will present differential privacy, a rigorous definition of privacy in statistical databases that has received significant attention. In the second part of the talk, I will explain how differential privacy is connected to a seemingly different problem: “adaptive data analysis”, the practice by which insights gathered from data are used to inform further analysis of the same data sets. This is increasingly common both in scientific research, in which data sets are shared and re-used across multiple studies. Classical statistical theory assumes that the analysis to be run is selected independently of the data. This assumption breaks down when data re re-used; the resulting dependencies can significantly bias the analyses’ outcome. I’ll show how the limiting the information revealed about a data set during analysis allows one to control such bias, and why differentially private analyses provide a particularly attractive tool for limiting information. Based on several papers, including recent joint works with R. Bassily, K. Nissim, U. Stemmer, T. Steinke and J. Ullman (STOC 2016) and R. Rogers, A. Roth and O. Thakkar (FOCS 2016).

**Bio:** Adam Smith is a professor of Computer Science and Engineering at Penn State. His research interests lie in data privacy and cryptography, and their connections to machine learning, statistics, information theory, and quantum computing. He received his Ph.D. from MIT in 2004 and has held visiting positions at the Weizmann Institute of Science, UCLA, Boston University and Harvard. In 2009, he received a Presidential Early Career Award for Scientists and Engineers (PECASE). In 2016, he received the Theory of Cryptography Test of Time award, jointly with C. Dwork, F. McSherry and K. Nissim.

(an ML talk in the CS Speaker Series)

Fri 10/21/16, 01:30pm, Clark Hall 314**Segmentation and Tracking in Bioimage Analysis, and the Discrete Optimization Problems they Engender***Fred Hamprecht, University of Heidelberg*

**Abstract:** In connectomics, we wish to trace each and every neurite in massive 3D images; and in developmental biology, we hope to track the position and fate of each and every cell. Both problems remain difficult, in spite of ever-improving data quality. In this talk, I will present both our modeling efforts and progress in the associated combinatorial optimization problems. In connectomics, I will show that the NP-hard but principled multicut / correlation clustering problem affords a great boost in accuracy over deep neural networks alone. I will also introduce heuristic solvers that reach good minima in a fraction of the time required by full integer linear program solvers. On the tracking side, I will advocate a model for *joint* segmentation and tracking. If cells can divide, this model also is NP-hard. Again, a specialized heuristic solver can come to the rescue in large instances. The connectomics pipeline currently tops the leaderboard of the blind ISBI 2012 and the CREMI challenges, and the tracking work performed best on at least one dataset of the last Cell Tracking Challenge.

**Bio:** Fred develops and applies machine learning methods to solve challenging problems in bioimage analysis. He is particularly interested in active or weakly supervised learning. His group puts great emphasis on the development of user-friendly software (such as the http://ilastik.org framework) that can be used by experimentalists. Fred studied at ETH Zurich. He is a Professor at the University of Heidelberg, a co-founder of the Heidelberg Collaboratory for Image Processing (HCI), a Visitor to the HHMI Janelia Farm Research Campus, and still thinks of science as the greatest job on earth. https://hciweb.iwr.uni-heidelberg.de/people/fhamprec

(an ML talk in the CIS Speaker Series)

Mon 10/10/16, 12:15pm, School of Public Health, Room W2008**Improving the Robustness of Doubly Robust Estimators in Missing Data Analysis***Peisong Han, University of Waterloo*

**Abstract:** Estimators that are robust against model misspecifications are highly desired. In missing data analysis, doubly robust estimators are consistent if either the model for selection probability or the model for data distribution is correctly specified. We will present a method that exhibits an improved robustness. This method can simultaneously account for multiple models for both selection probability and data distribution. The resulting estimators are consistent if any one model is correctly specified. In addition, these estimators achieve maximal possible efficiency when both quantities are correctly modeled and are not sensitive to near-zero values of estimated selection probability. This new method is based on the calibration idea from sampling survey literature and has a strong connection to empirical likelihood.

(an ML talk in the Biostatistics Speaker Series)

Fri 10/07/16, 12:00pm, Hackerman B17**Procedural Language and Knowledge***Yejin Choi, University of Washington*

**Abstract:** Various types of how-to-knowledge are encoded in natural language instructions: from setting up a tent, to preparing a dish for dinner, and to executing biology lab experiments. These types of instructions are based on procedural language, which poses unique challenges. For example, verbal arguments are commonly elided when they can be inferred from context, e.g., “bake for 30 minutes”, not specifying bake what and where. Entities frequently merge and split, e.g., “vinegar’’ and “oil’’ merging into “dressing’’, creating challenges to reference resolution. And disambiguation often requires world knowledge, e.g., the implicit location argument of “stir frying” is on “stove”. In this talk, I will present our recent approaches to interpreting and composing cooking recipes that aim to address these challenges. In the first part of the talk, I will present an unsupervised approach to interpreting recipes as action graphs, which define what actions should be performed on which objects and in what order. Our work demonstrates that it is possible to recover action graphs without having access to gold labels, virtual environments or simulations. The key insight is to rely on the redundancy across different variations of similar instructions that provides the learning bias to infer various types of background knowledge, such as the typical sequence of actions applied to an ingredient, or how a combination of ingredients (e.g., “flour”, “milk”, “eggs”) becomes a new entity (e.g, “wet mixture”). In the second part of the talk, I will present an approach to composing new recipes given a target dish name and a set of ingredients. The key challenge is to maintain global coherence while generating a goal-oriented text. We propose a Neural Checklist Model that attains global coherence by storing and updating a checklist of the agenda (e.g., an ingredient list) with paired attention mechanisms for tracking what has been already mentioned and what needs to be yet introduced. This model also achieves strong performance on dialogue system response generation. I will conclude the talk by discussing the challenges in modeling procedural language and acquiring the necessary background knowledge, pointing to avenues for future research.

**Bio:** Yejin Choi is an assistant professor at the Computer Science & Engineering Department of University of Washington. Her recent research focuses on language grounding, integrating language and vision, and modeling nonliteral meaning in text. She was among the IEEE’s AI Top 10 to Watch in 2015 and a co-recipient of the Marr Prize at ICCV 2013. Her work on detecting deceptive reviews, predicting the literary success, and learning to interpret connotation has been featured by numerous media outlets including NBC News for New York, NPR Radio, New York Times, and Bloomberg Business Week. She received her Ph.D. in Computer Science at Cornell University.

(an ML talk in the CLSP Speaker Series)

Thu 10/06/16, 01:30pm, Whitehead 304**Stochastic Search Methods for Simulation Optimization***Enlu Zhou, Georgia Tech University*

**Abstract:** A variety of systems arising in finance, engineering design, and manufacturing require the use of optimization techniques to improve their performance. Due to the complexity and stochastic dynamics of such systems, their performance evaluation frequently requires computer simulation, which however often lacks structure needed by classical optimization methods. We developed a gradient-based stochastic search approach, based on the idea of converting the original (structure-lacking) problem to a differentiable optimization problem on the parameter space of a sampling distribution that guides the search. A two-timescale updating scheme is further studied and incorporated to improve the algorithm efficiency. Convergence properties of our approach are established through techniques from stochastic approximation, and the performance of our algorithms is illustrated in comparison with some state-of-the-art simulation optimization methods. This is a joint work with Jiaqiao Hu (Stony Brook University) and Shalabh Bhartnagar (Indian Institute of Science).

**Bio:** Dr. Zhou has been active in the theory, methods, and applications of simulation optimization and stochastic control. She currently works on the development of efficient algorithms for (i) optimizing and predicting performance of complex systems that are described by stochastic simulation models, and (ii) solving dynamic decision-making problems under uncertainty and driven by data. Her research is at the interface of simulation, control, and optimization. The application areas of her research include financial engineering, inventory control, and systems biology.

(an ML talk in the AMS Speaker Series)

Mon 10/03/16, 12:15pm, BSPH Room W2008**Real-Time Prediction of Infectious Disease Outbreaks***Nicholas Reich, University of Massachusetts, Amherst*

**Abstract:** Creating statistical models that generate accurate predictions of infectious disease incidence over multiple time points is a challenging problem. Forecasts of infectious disease outbreaks can be used to inform targeted intervention and prevention strategies such as increased healthcare staffing or vector control measures. I will present a new approach to predicting infectious disease that uses kernel conditional density estimation (KCDE) and copulas. The method was a top performer in the 2015-2016 FluSight influenza forecasting challenge run in real-time by the CDC. The method works by obtaining predictive distributions for incidence in individual weeks using KCDE and then tying those distributions together into joint distributions using copulas. This strategy enables us to create predictions for the timing of and incidence in the peak week of a given season. Our implementation of KCDE also incorporates two novel kernel components: a periodic component that captures seasonality in disease incidence, and a component that allows for a full parameterization of the bandwidth matrix with discrete variables. We apply the method to predicting dengue fever and influenza. Overall, KCDE compares favorably to other standard prediction methods including a seasonal autoregressive integrated moving average (SARIMA) model and a previously published generalized linear model for infectious disease incidence known as HHH4. Finally, I will discuss ongoing work that integrates this method into an ensemble model averaging framework using model weights that vary based on the time of year predictions are made.

(an ML talk in the Biostatistics Speaker Series)

Wed 09/28/16, 03:00pm, Krieger 309**Global Optimality in Matrix and Tensor Factorization, Deep Learning, and Beyond***Rene Vidal, Johns Hopkins University*

**Abstract:** Matrix, tensor, and other factorization techniques are used in many applications and have enjoyed significant empirical success in many fields. However, common to a vast majority of these problems is the significant disadvantage that the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation. Building on ideas from convex relaxations of matrix factorizations, in this talk I will present a very general framework which allows for the analysis of a wide range of non-convex factorization problems – including matrix factorization, tensor factorization, and deep neural network training formulations. In particular, I will present sufficient conditions under which a local minimum of the non-convex optimization problem is a global minimum and show that if the size of the factorized variables is large enough then from any initialization it is possible to find a global minimizer using a local descent algorithm. Related paper: https://arxiv.org/abs/1506.07540

Tue 09/27/16, 12:00pm, Bloomberg School of Public Health E3609 (Genome Cafe)**Bayesian Nonparametric Models for Causal Inference with Missing-At-Random Covariates***Jason Roy, University of Pennsylvania*

**Abstract:** We propose a general Bayesian non-parametric approach to causal inference in the point treatment setting. The joint distribution of the observed data (outcome, treatment, and confounders) is modeled using an enriched Dirichlet process. The combination of the observed data model and causal assumptions allows us to identify any type of causal effect – differences, ratios, or quantile effects, either marginally or for subpopulations of interest. The proposed BNP model is well-suited for causal inference problems, as it does not require parametric assumptions about the distribution of confounders and naturally leads to a computationally efficient Gibbs sampling algorithm. By flexibly modeling the joint distribution, we are also able to impute values for missing covariates within the algorithm, obviating the need to create separate imputed data sets. Imputing data within the algorithm also has the advantage of guaranteeing congeniality between the imputation model and analysis model, and, because we use a BNP approach, parametric models are avoided for imputation. The performance of the method is assessed using simulation studies. The method is applied to data from a cohort study of HIV / HCV co-infected patients.

Thu 09/22/16, 01:30pm, Whitehead 304**High-Dimensional Analysis of Stochastic Algorithms for Convex and Nonconvex Optimization: Limiting Dynamics and Phase Transitions***Yue Lu, Harvard University*

**Abstract:** We consider efficient iterative methods (e.g., stochastic gradient descent, randomized Kaczmarz algorithms, iterative coordinate descent) for solving large-scale optimization problems, whether convex or nonconvex. A flurry of recent work has focused on establishing their theoretical performance guarantees. This intense interest is spurred on by the remarkably impressive empirical performance achieved by these low-complexity and memory-efficient methods. In this talk, we will present a framework for analyzing the exact dynamics of these methods in the high-dimensional limit. For concreteness, we consider two prototypical problems: regularized linear regression (e.g. LASSO) and sparse principal component analysis. For each case, we show that the time-varying estimates given by the algorithms will converge weakly to a deterministic “limiting process” in the high-dimensional (scaling and mean-field) limit. Moreover, this limiting process can be characterized as the unique solution of a nonlinear PDE, and it provides exact information regarding the asymptotic performance of the algorithms. For example, performance metrics such as the MSE, the cosine similarity and the misclassification rate in sparse support recovery can all be obtained by examining the deterministic limiting process. A steady-state analysis of the nonlinear PDE also reveals interesting phase transition phenomenons related to the performance of the algorithms. Although our analysis is asymptotic in nature, numerical simulations show that the theoretical predictions are accurate for moderate signal dimensions. What makes our analysis tractable is the notion of exchangeability, a fundamental property of symmetry that is inherent in many of the optimization problems encountered in signal processing and machine learning.

**Bio:** Yue M. Lu was born in Shanghai. After finishing undergraduate studies at Shanghai Jiao Tong University, he attended the University of Illinois at Urbana-Champaign, where he received the M.Sc. degree in mathematics and the Ph.D. degree in electrical engineering, both in 2007. He was a Research Assistant at the University of Illinois at Urbana-Champaign, and has worked for Microsoft Research Asia, Beijing, and Siemens Corporate Research, Princeton, NJ. Following his work as a postdoctoral researcher at the Audiovisual Communications Laboratory at Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland, he joined Harvard University in 2010, where he is currently an Associate Professor of Electrical Engineering at the John A. Paulson School of Engineering and Applied Sciences. He received the Most Innovative Paper Award (with Minh N. Do) of IEEE International Conference on Image Processing (ICIP) in 2006, the Best Student Paper Award of IEEE ICIP in 2007, and the Best Student Presentation Award at the 31st SIAM SEAS Conference in 2007. Student papers supervised and coauthored by him won the Best Student Paper Award (with Ivan Dokmanic and Martin Vetterli) of IEEE International Conference on Acoustics, Speech and Signal Processing in 2011 and the Best Student Paper Award (with Ameya Agaskar and Chuang Wang) of IEEE Global Conference on Signal and Information Processing (GlobalSIP) in 2014. He has been an Associate Editor of the IEEE Transactions on Image Processing since 2014, an Elected Member of the IEEE Image, Video, and Multidimensional Signal Processing Technical Committee since 2015, and an Elected Member of the IEEE Signal Processing Theory and Methods Technical Committee since 2016. He received the ECE Illinois Young Alumni Achievement Award in 2015.

(an ML talk in the AMS Speaker Series)

Tue 09/20/16, 02:00pm, Gilman 132**From Molecular Dynamics to Large Scale Inference***Ben Leimkuhler, University of Edinburgh*

**Abstract:** Molecular models and data analytics problems give rise to very large systems of stochastic differential equations (SDEs) whose paths are designed to ergodically sample multimodal probability distributions. An important challenge for the numerical analyst (or the data scientist, for that matter) is the design of numerical procedures to generate these paths. One of the interesting ideas is to construct stochastic numerical methods with close attention to the error in the invariant measure. Another is to redesign the underlying stochastic dynamics to reduce bias or locally transform variables to enhance sampling efficiency. I will illustrate these ideas with various examples, including a geodesic integrator for constrained Langevin dynamics and an ensemble sampling strategy for distributed inference.

Thu 09/15/16, 01:30pm, Maryland Hall 110**Stochastic Newton Methods for Machine Learning***Jorge Nocedal, Northwestern University*

**Abstract:** Optimization methods play a crucial role in supervised learning where they are employed to solve problems in very high dimensional parameter spaces. The optimization problems are inherently stochastic and often involve huge data sets. There has recently been much interest in sub-sampled Newton methods for these types of applications. The methods use approximations to the gradient and Hessian in a way that strikes a balance between computational effort and speed of convergence. We provide a review and analysis of sub-sampled Newton methods, which include Newton-sketch and non-uniform subsampling techniques, and illustrate their effectiveness on some large-scale machine learning applications

(an ML talk in the AMS Speaker Series)

Mon 09/12/16, 12:15pm, School of Public Health, Room W2008**Online Estimation of Optimal Treatment Allocations for Control of an Emerging Infectious Disease***Eric Laber, North Carolina State University Department of Statistics*

**Abstract:** Emerging infectious diseases are responsible for high morbidity and mortality, economic damages to affected countries, and are a major vulnerability for global stability. Technological advances have made it possible to collect, curate, and access large amounts of data on the progression of an infectious disease. We derive a framework for using this data in real-time to inform disease management. We formalize a treatment allocation strategy as a sequence of functions, one per treatment period that map up-to-date information on the spread of an infectious disease to a subset of locations for treatment. An optimal allocation strategy optimizes some cumulative outcome, e.g., the number of uninfected locations, the geographic footprint of the disease, or the cost of the epidemic. Estimation of an optimal allocation strategy for an emerging infectious disease is challenging because spatial proximity induces interference among locations, the number of possible allocations is exponential in the number of locations, and because disease dynamics and intervention effectiveness are unknown at outbreak. We derive a Bayesian online estimator of the optimal allocation strategy that combines simulation-optimization with Thompson sampling. The proposed estimator performs favorably in simulation experiments. This work is motivated by and illustrated using data on the spread of white-nose syndrome a highly fatal infectious disease devastating bat populations in North America.

(an ML talk in the Biostatistics Speaker Series)

Tue 09/06/16, 11:00am, Clark Hall 110**Mining Big Data for Molecular Marker Detection***Su-In Lee, University of Washington*

**Abstract:** The repertoire of drugs for patients with cancer is rapidly expanding, however cancers that appear pathologically similar often respond differently to the same drug regimens. Methods to better match patients to specific drugs are in high demand. There is a fair amount of data on molecular profiles from patients with cancer. The most important step necessary to realize the ultimate goal is to identify molecular markers in these data that predict the response to each of hundreds of chemotherapy drugs. However, due to the high-dimensionality (i.e., the number of variables is much greater than the number of samples) along with potential biological or experimental confounders, it is an open challenge to identify robust biomarkers that are replicated across different studies. In this talk, I will present two distinct machine learning techniques to resolve these challenges. These methods learn the low-dimensional features that are likely to represent important molecular events in the disease process in an unsupervised fashion, based on molecular profiles from multiple populations of patients with specific cancer type. I will present two applications of these two methods – acute myeloid leukemia (AML) and ovarian cancer. When the first method was applied to AML data in collaboration with UW Hematology and UW’s Center for Cancer Innovation, a novel molecular marker for topoisomerase inhibitors, widely used chemotherapy drugs in AML treatment, was revealed. The other method applied to ovarian cancer data led to a potential molecular driver for tumor-associated stroma, in collaboration with UW Pathology and UW Genome Sciences. Our methods are general computational frameworks and can be applied to many other diseases.

**Bio:** Su-In Lee is an Assistant Professor in the Departments of Computer Science & Engineering and Genome Sciences at the University of Washington. She received her Ph.D. degree in Electrical Engineering from Stanford University in 2009. Before joining the UW in 2010, she was a Visiting Assistant Professor in the Computational Biology Department at Carnegie Mellon University. Her interest is in developing advanced machine learning algorithms to analyze high-throughput data to 1) discover molecular mechanisms of diseases, 2) identify therapeutic targets, and 3) develop personalized treatment plans given an individual’s molecular profile. She has been named an American Cancer Society Research Scholar and received the NSF CAREER award. Her lab is currently funded by the American Cancer Society, the National Institutes of Health, the National Science Foundation, the Institute of Translational Health Sciences and the Solid Tumor Translational Research.

Wed 05/11/16, 01:30pm, Clark Hall 314**Big Data in Behavioral Medicine***James M. Rehg, Georgia Institute of Technology*

**Abstract:** The explosion of health-related data in the form of electronic health records and genomics has captured the attention of both the machine learning and medical informatics communities. In this talk, I will describe an emerging opportunity to bring data analytics to bear on the behavioral dimension of health in addition. As we move towards a preventive and anticipatory approach to health and medicine, understanding the genesis of adverse health-related behaviors (such as smoking and unhealthy eating), and developing more effective interventions for behavior change, becomes a critical challenge. Advances in mobile sensing, from classical activity monitoring to the recent advent of wearable cameras, provide new opportunities to continuously-measure behaviors under naturalistic conditions and construct novel predictive models for adverse behavioral outcomes. Behavioral data is inherently multimodal and time-varying, with complex dynamics over multiple temporal scales, and poses several interesting machine learning challenges. This talk will provide an overview of this emerging research area and highlight several of our on-going projects. Through the NIH-funded MD2K center (http://md2k.org), we are collaborating with a multi-university team to construct an open source platform for sensing health-related behaviors in the field and triggering mobile interventions. I will provide an overview of that effort and the health challenges in smoking cessation and heart failure that we are addressing. I will describe some recent work on efficient parameter learning for continuous-time hidden Markov models (CT-HMM), that can support trajectory-modeling and prediction for sequences of event data from clinical populations. I will also present work on developing novel sensors for social behaviors in the context of treating children with autism. In particular, I will describe a recent method for automatically measuring eye contact during face-to-face social interactions using wearable cameras, and describe its clinical applications. This is joint work with Drs. Agata Rozga, Yu-Ying Liu, and Fuxin Li, and PhD students Alexander Moreno, Yin Li, and Eunji Chong.

**Bio:** James M. Rehg (pronounced “ray”) is a Professor in the School of Interactive Computing at the Georgia Institute of Technology, where he is Director of the Center for Behavioral Imaging and co-Director of the Computational Perception Lab (CPL). He received his Ph.D. from CMU in 1995 and worked at the Cambridge Research Lab of DEC (and then Compaq) from 1995-2001, where he managed the computer vision research group. He received an NSF CAREER award in 2001 and a Raytheon Faculty Fellowship from Georgia Tech in 2005. He and his students have received best student paper awards at ICML 2005, BMVC 2010, Mobihealth 2014, and Face and Gesture 2015, and a 2013 Method of the Year Award from the journal Nature Methods. Dr. Rehg serves on the Editorial Board of the Intl. J. of Computer Vision, and he served as the Program co-Chair for ACCV 2012 and General co-Chair for CVPR 2009, and will serve as Program co-Chair for CVPR 2017. He has authored more than 100 peer-reviewed scientific papers and holds 25 issued US patents. His research interests include computer vision, machine learning, robot perception and mobile health. Dr. Rehg was the lead PI on an NSF Expedition to develop the science and technology of Behavioral Imaging, the measurement and analysis of social and communicative behavior using multi-modal sensing, with applications to developmental disorders such as autism. He is currently the Deputy Director of the NIH Center of Excellence on Mobile Sensor Data-to-Knowledge (MD2K), which is developing novel on-body sensing and predictive analytics for improving health outcomes.

(an ML talk in the CIS Speaker Series)

Tue 05/10/16, 04:30pm, Clark 314**How to break the unrealistic symmetries of data analysis on high-dimensional manifolds:Tactics and prototypes from morphometrics***Fred Bookstein, University of Washington*

**Abstract:** To balance realism and formalism in quantitative scientific contexts, empirical evidence about actual processes must be allowed to break the symmetries of otherwise elegant mathematical foundations. When advanced multivariate analyses are applied to data living in high-dimensional manifolds, for instance, there is more to breaking symmetry than just contradicting the i.i.d. model for specimens: one can also consider restructuring the space of the variables themselves. Manifolds offer two kinds of geometrical anisotropy, differences of covariance structure from point to point over the manifold and differences of meaning from direction to direction within the tangent space at a point. Today’s geometric morphometrics offers examples of broken symmetry of both these types. My talk will sketch several of these possibilities, including one that particularly intrigues me these days: the imposition of a _scaling dimension_ that models directional shape variance as loglinear in directional bending energy (with a negative slope). Prior biological knowledge is thereby encoded in an interpretable quadratic form intended to dominate information from covariance structures. Such a stratagem might take its place along with gravity, texture, energetics, and the other approaches that, taken together, inject much-needed anisotropies into the applied mathematics of image-driven studies of living organisms.

(an ML talk in the CIS Speaker Series)

Tue 05/10/16, 01:30pm, Clark 314**Internal Representations in Deep Networks for Object Detection***Alan Yuille, JHU*

**Abstract:** Deep Networks are very successful for a range of visual tasks. But they remain “black boxes” whose internal representations are hard to understand. This talk describes recent work where we study the internal activity of deep networks when applied to objects such as cars and airplanes. We show that the deep networks form internal representations, which we call visual concepts, and which are activated by image patches which are visually very similar. To test these visual concepts we annotated these objects into semantic parts so that the coverage is complete, in the sense that each instance of the object can be fully represented by semantic parts. Then we show that the visual concepts, within the deep network, correspond to semantic parts of the objects. This throws some light on the internal representations of deep nets and can also be used as unsupervised part discovery.

**Bio:** Alan Yuille was born in England and received a BA in Mathematics and a PhD in Theoretical Physics from Cambridge University. He has been a postdoc at U.T. Austin, MIT, and Harvard University. He has been a Professor at Harvard, UCLA, and Korea University and a senior researcher at the Smith-Kettlewell Eye Research Institute. He is currently a Bloomberg Distinguished Professor in Cognitive Science and Computer Science at Johns Hopkins University. His research interests include Computer and Biological Vision, Machine Learning, and Neural Modeling.

(an ML talk in the CIS Speaker Series)

Thu 04/28/16, 01:30pm, Mergenthaler 111**Unveiling the mysteries in spatial gene expression***Bin Yu, UC Berkeley*

**Abstract:** Genome-wide data reveal an intricate landscape where gene activities are highly differentiated across diverse spatial areas. These gene actions and interactions play a critical role in the development and function of both normal and abnormal tissues. As a result, understanding spatial heterogeneity of gene networks is key to developing treatments for human diseases. Despite the abundance of recent spatial gene expression data, extracting meaningful information remains a challenge for local gene interaction discoveries. In response, we have developed staNMF, a method that combines a powerful unsupervised learning algorithm, nonnegative matrix factorization (NMF), with a new stability criterion that selects the size of the dictionary. Using staNMF, we generate biologically meaningful Principle Patterns (PP), which provide a novel and concise representation of Drosophila embryonic spatial expression patterns that correspond to pre-organ areas of the developing embryo. Furthermore, we show how this new representation can be used to automatically predict manual annotations, categorize gene expression patterns, and reconstruct the local gap gene network with high accuracy. Finally, we discuss on-going crispr/cas9 knock-out experiments on Drosophila to verify predicted local gene-gene interactions involving gap-genes. An open-source software is also being built based on SPARK and Fiji. This talk is based on collaborative work of a multi-disciplinary team (co-lead Erwin Frise) from the Yu group (statistics) at UC Berkeley, the Celniker group (biology) at the Lawrence Berkeley National Lab (LBNL), and the Xu group (computer science) at Hsinghua Univ.

Wed 04/27/16, 01:30pm, Gilman 50**Movie Reconstruction from Brain Signals: “Mind Reading”***Bin Yu, UC Berkeley*

**Abstract:** In a thrilling breakthrough at the intersection of neuroscience and statistics, penalized Least Squares methods have been used to construct a “mind-reading” algorithm that reconstructs movies from fMRI brain signals. The story of this algorithm is a fascinating tale of the interdisciplinary research that led to the development of the system which was selected as one of Time Magazine’s 50 Best Inventions of 2011.

Wed 04/27/16, 12:00pm, Malone 107**Multiresolution Matrix Factorization***Risi Kondor, University of Chicago*

**Abstract:** The sheer size of today’s datasets dictates that learning algorithms compress or reduce their input data and/or make use of parallelism. Multiresolution Matrix Factorization (MMF) makes a connection between such computational strategies and some classical themes in Applied Mathematics, namely Multiresolution Analysis and Multigrid Methods. In particular, the similarity matrices appearing in data often have multiresolution structure, which can be exploited both for learning and to facilitate computation. In addition to the general MMF framework, I will present our new, high performance parallel MMF software library and show some results on matrix compression/sketching tasks (joint work with Nedelina Teneva, Pramod Mudrakarta and Vikas Garg).

Mon 04/25/16, 12:10pm, Room W3008, School of Public Health**Sparse CCA: Statistical and Computational Limits***Zongming Ma, Wharton School, University of Pennsylvania*

**Abstract:** This talk will consider the problem of sparse canonical correlation analysis (sparse CCA). The first part of the talk will focus on the statistical side. We will argue that sparse CCA is intrinsically different from the well-studied sparse PCA problem because of the presence of high-dimensional nuisance parameters, namely, the marginal covariance matrices. A somewhat surprising result we derived shows that the minimax rate of sparse CCA is nearly independent of the structure of the marginal covariance matrices. The second part of the talk will focus on the computational side. A two-stage algorithm is proposed to achieve the minimax rate adaptively under an extra sample size condition. We will further present a computational lower bound argument to show that the additional sample size condition is essentially necessary for any polynomial-time algorithm to work under a Planted Clique hypothesis. A novel reduction procedure is constructed to ensure that the lower bound is faithful to the model. A byproduct of the argument also provides the first computational lower bound for sparse PCA under Gaussian spiked covariance model. This talk is based on joint work with Chao Gao, Zhao Ren and Harrison Zhou.

(an ML talk in the Biostatistics Speaker Series)

Mon 04/18/16, 12:00am, Room W3008, Bloomberg School of Public Health**Noise-Addition Methods and the False Selection Rate (FSR) Approach***Dennis Boos (Joint Work with Len Stefanski), Dept. of Statistics, North Carolina State University*

**Abstract:** The simulation-based method SIMEX (Cook and Stefanski, 1994, JASA), designed for analyzing measurement error models, showed that constructively adding noise to data can reveal properties of statistical methods thus enabling their tuning to achieve certain desirable properties. The strategy of adding “noise” to data can also be used for the purpose of studying and improving variable selection methods. In this talk I give an overview of this noise-addition variable selection methodology. In particular, we adapted the SIMEX idea by adding noise to the response variable in linear regression leading to a choice of the tuning parameter. This research then led us to adding noise in the form of additional explanatory variables uncorrelated with the response variable, and monitoring the rate (FSR) at which the phony additional variables enter the model for different tuning values. This led to a choice of tuning parameter for general regression contexts with a connection to adaptive false discovery methods.

(an ML talk in the Biostatistics Speaker Series)

Fri 04/15/16, 12:00pm, Hackerman B17**Sampling to Efficiently Train Bilingual Neural Network Language Models***Colin Cherry, National Research Council of Canada*

**Abstract:** The neural network joint model of translation (NNJM) is a language model that considers both source and target context to produce a powerful feature for statistical machine translation.However, its softmax top layer necessitates a sum over the entire output vocabulary, which results in very slow maximum likelihood (MLE) training. This has led some groups to train using Noise Contrastive Estimation (NCE), which side steps this sum by optimizing an alternate objective, aiming to differentiate true data points from sampled noise. We carry out the first direct comparison of MLE and NCE training objectives for the NNJM, showing that NCE is significantly outperformed by MLE on large-scale Arabic-English and Chinese-English translation tasks. We also show that this drop can be avoided by using a simple, translation-specific noise distribution that conditions on the source sentence.

**Bio:** Colin Cherry is a Senior Research Officer at the National Research Council of Canada. Previously, he was a Researcher at Microsoft Research. He received his Ph.D. in Computing Science from the University of Alberta. His primary research area is machine translation, but he has also been known to venture into parsing, morphology and information extraction. He is currently secretary of the NAACL, and recently sat on the editorial board of Computational Linguistics.

(an ML talk in the CLSP Speaker Series)

Thu 04/14/16, 01:30pm, Whitehead 304**Scalable Bayesian Models of Interacting Time Series***Emily Fox, University of Washington*

**Abstract:** Data streams of increasing complexity and scale are being collected in a variety of fields ranging from neuroscience, genomics, and environmental monitoring to e-commerce. Modeling the intricate and possibly evolving relationships between the large collection of series can lead to increased predictive performance and domain-interpretable structures. For scalability, it is crucial to discover and exploit sparse dependencies between the data streams. Such representational structures for independent data sources have been studied extensively, but have received limited attention in the context of time series. In this talk, we present a series of Bayesian models for capturing such sparse dependencies via clustering, graphical models, and low-dimensional embeddings of time series. We explore these methods in a variety of applications, including house price modeling and inferring networks in the brain. We then turn to observed interaction data, and briefly touch upon how to devise statistical network models that capture important network features like sparsity of edge connectivity. Within our Bayesian framework, a key insight is to move to a continuous-space representation of the graph, rather than the typical discrete adjacency matrix structure. We demonstrate our methods on a series of real-world networks with up to hundreds of thousands of nodes and millions of edges.

**Bio:** Emily Fox is currently the Amazon Professor of Machine Learning in the Statistics Department at the University of Washington. She received a S.B. in 2004 and Ph.D. in 2009 from the Department of Electrical Engineering and Computer Science at MIT. She has been awarded a Sloan Research Fellowship (2015), an ONR Young Investigator award (2015), an NSF CAREER award (2014), the Leonard J. Savage Thesis Award in Applied Methodology (2009), and the MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize (2009). Her research interests are in large-scale Bayesian dynamic modeling and computations.

(an ML talk in the AMS Speaker Series)

Fri 04/08/16, 12:00pm, Hackerman Hall B17**Deep Learning and Linguistic Structure***Alexander “Sasha” Rush, Harvard University*

**Abstract:** Earlier this year, Chris Manning published an essay observing that “2015 was the year NLP felt the full force of the (deep learning) tsunami.” While expressing excitement for the success of these methods and the sudden burst of interest in NLP, he laments the turn away from the “domain science of language technology” and calls for research that further delves “into problems, approaches, and architectures.” In this talk I will present our group’s progress in developing novel approaches for capturing linguistic structure in deep learning-based models. In particular I will focus on end-to-end training of two unconstrained latent architectures designed to learn specific underlying structure: (1) a character-aware neural model that learns the morphological structure of words (Kim et al, 2016), and (2) a cluster-aware neural model that learns latent representations of discourse entities (Wiseman et al, 2016). While these are both “black-box” deep learning models, we can explore the internal representations to better understand what aspects of linguistic structure is being learned. I will present applications of these models to the tasks of language modeling, coreference resolution, machine translation, and grammar correction.

**Bio:** Alexander Rush is an Assistant Professor of Computer Science at Harvard University, and formerly a Post-doctorate Fellow at Facebook Artificial Intelligence Research (FAIR). He is interested in machine learning methods for natural language processing and understanding. His past work has introduced novel methods for large-scale structured prediction with applications to syntactic parsing and machine translation.

(an ML talk in the CLSP Speaker Series)

Thu 04/07/16, 01:30pm, Whitehead 304**Distributed proximal gradient methods for cooperative multi-agent consensus optimization***Serhat Aybat, Penn State University*

**Abstract:** In this talk, I will discuss decentralized methods for solving cooperative multi-agent consensus optimization problems. Consider an undirected network of agents, where only those agents connected by an edge can directly communicate with each other. The objective is to minimize the sum of agent-specific composite convex functions, i.e., each term in the sum is a private cost function belonging to an agent. In the first part, I will discuss the unconstrained case, and in the second part I will focus on the constrained case, where each agent has a private conic constraint set. For the constrained case the optimal consensus decision should lie in the intersection of these private sets. This optimization model abstracts a number of applications in machine learning, distributed control, and estimation using sensor networks. I will discuss different types of distributed algorithms; in particular, I will describe methods based on inexact augmented Lagrangian, and linearized ADMM. I will provide convergence rates both in sub-optimality error and consensus violation; and also examine the effect of underlying network topology on the convergence rates of the proposed decentralized algorithms. Joint work with Ph.D. students Zi Wang, Erfan Yazdandoost, and Shiqian Ma from Chinese University of Hong Kong, and Garud Iyengar from Columbia University.

(an ML talk in the AMS Speaker Series)

Mon 04/04/16, 12:15pm, School of Public Health, Room W3008**Inference of Low Dimensional Parameters with High-Dimensional Data***Cun-Hui Zhang, Rutgers University*

**Abstract:** We consider sample size requirements for statistical inference in a semi-low-dimensional approach to the analysis of high-dimensional data. The relationship between this semi-low- dimensional approach and regularized estimation of high-dimensional objects is parallel to the more familiar one between semiparametric analysis and nonparametric estimation. We discuss three equivalent forms of a low-dimensional projection estimator and two choices in approximating the direction of the least favorable submodel for the estimation of the low-dimensional parameter. We discuss regular efficiency cases where the sample size requirement is n » (s log p) 2 for the approach to work, where p is the nominal dimension and s is a measure of the complexity of the model. Examples of regular efficiency include the estimation of a quadratic function of a high-dimensional vector and graphical model estimation. We also discuss super-efficiency cases where the sample size requirement can be as low as n » s log p. Examples of super-efficiency include the estimation of a treatment effect in a randomized experiment and the estimation of regression coefficients with certain knowledge of the population Gram matrix. Ancillarity and benefits of unlabeled data will be discussed if time permits.

(an ML talk in the Biostatistics Speaker Series)

Thu 03/31/16, 01:30pm, Whitehead 304**Co-clustering of Nonsmooth Graphons***David Choi, CMU*

**Abstract:** Theoretical results are becoming known for community detection and clustering of networks; however, these results assume an idealized generative model that is unlikely to hold in many settings. Here we consider exploratory co-clustering of a bipartite network, where the rows and columns of the adjacency matrix are assumed to be samples from an arbitrary population. This is equivalent to assuming that the data is generated from a nonparametric model known as a graphon. We show that co-clusters found by any method can be extended to the row and column populations, or equivalently that the estimated blockmodel approximates a blocked version of the generative graphon, with generalization error bounded by n^{-1/2}. Analogous results are also shown for degree-corrected co-blockmodels and random dot product bipartite graphs, with error rates depending on the dimensionality of the latent variable space.

(an ML talk in the AMS Speaker Series)

Tue 03/29/16, 10:45am, Hackerman Hall B17**Graphical Models for Missing Data: Recoverability, Testability and Recent Surprises!***Karthika Mohan, UCLA*

**Abstract:** The bulk of literature on missing data employs procedures that are data-centric as opposed to process-centric and relies on a set of strong assumptions that are primarily untestable (eg: Missing At Random, Rubin 1976). As a result this area of research is wanting in tools to encode assumptions about the underlying data-generating process, methods to test these assumptions and procedures to decide if queries of interest are estimable and if so to compute their estimands. We address these deficiencies by using a graphical representation called “Missingness Graph” which portrays the causal mechanisms responsible for missingness. Using this representation, we define the notion of recoverability, i.e., deciding whether there exists a consistent estimator for a given query. We identify graphical conditions for recovering joint and conditional distributions and present algorithms for detecting these conditions in the missingness graph. Our results apply to missing data problems in all three categories — MCAR, MAR and MNAR — the latter is relatively unexplored. We further address the question of testability i.e. whether an assumed model can be subjected to statistical tests, considering the missingness in the data. Furthermore viewing the missing data problem from a causal perspective has ushered in several surprises. These include recoverability when variables are causes of their own missingness, testability of the MAR assumption, alternatives to iterative procedures such as EM Algorithm and the indispensability of causal assumptions for large sets of missing data problems.

(an ML talk in the CS Speaker Series)

Wed 03/23/16, 12:00pm, Hackerman Hall B17**The Computational, Statistical and Practical Aspects of Machine Learning***Yaoliang Yu, Carnegie Mellon University*

**Abstract:** The big data revolution has profoundly changed, among many other things, how we perceive business, research, and application. However, in order to fully realize the potential of big data, certain computational and statistical challenges need to be addressed. In this talk, I will present my research in facilitating the deployment of machine learning methodologies and algorithms in big data applications. I will first present robust methods that are capable of accounting for uncertain or abnormal observations. Then I will present a generic regularization scheme that automatically extracts compact and informative representations from heterogeneous, multi-modal, multi-array, time-series, and structured data. Next, I will discuss two gradient algorithms that are computationally very efficient for our regularization scheme, and I will mention their theoretical convergence properties and computational requirements. Finally, I will present a distributed machine learning framework that allows us to process extremely large-scale datasets and models. I conclude my talk by sharing some future directions that I am and will be pursuing.

**Bio:** Yaoliang Yu is currently a research scientist affiliated with the center for machine learning and health, and the machine learning department of Carnegie Mellon University. He obtained his PhD (under Dale Schuurmans and Csaba Szepesvari) in computing science from University of Alberta (Canada, 2013), and he received the PhD Dissertation Award from the Canadian Artificial Intelligence Association in 2015

(an ML talk in the CS Speaker Series)

Wed 03/23/16, 09:00am, Hackerman Hall 320**Using Motion to Understand Objects in the Real World***David Held, U.C. Berkeley*

**Abstract:** Many robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If we model how the world changes over time, we can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I show how we can use this idea to improve performance on three different tasks: velocity estimation, segmentation, and tracking with neural networks. By modeling the causes of appearance changes over time, we can make our methods more robust to a variety of challenging situations that commonly occur in the real-world, thus enabling robots to come out of the factory and into our lives.

**Bio:** David Held is a Post-doctoral Researcher at U.C. Berkeley working with Pieter Abbeel. He recently completed his Ph.D. in Computer Science at Stanford, doing research at the intersection of robotics, computer vision, and machine learning. His Ph.D. was co-advised by Sebastian Thrun and Silvio Savarese. David has also interned at Google, working on the self-driving car project. Before Stanford, he worked as a software developer for a startup company and was a researcher at the Weizmann Institute, working on building a robotic octopus. He received a B.S. in Mechanical Engineering at MIT in 2005, an M.S. in Mechanical Engineering at MIT 2007, and an M.S. in Computer Science at Stanford in 2012, for which he was awarded the Best Master’s Thesis Award from the Computer Science Department.

Thu 03/10/16, 01:30pm, Whitehead 304**A Bayesian Nonparametric Model for Comparative Effectiveness Research***Gary Rosner, JHU (School of Medicine)*

**Abstract:** Comparative effectiveness research entails combining relevant data from disparate sources. Information about the effect and effectiveness of a treatment strategy on a disease, such as cancer or heart disease, may come from randomized clinical trials (RCTs), early phase non-randomized studies, and hospital or payer databases. In general, RCTs are considered to provide the highest level of evidence for comparing treatments. Including historical information, however, could improve efficiency of the trial by reducing the current study’s required sample size and, possibly, by helping to put the ultimate results of the study in an appropriate context for comparing effectiveness of the new treatment to that of the standard of care. One needs to account for inherent differences between the studies, however, because heterogeneities may bias the inference. In this talk, I discuss some of the advantages the Bayesian approach to inference offers when carrying out comparative effectiveness research. In particular, I will present a flexible inferential model that uses Bayesian nonparametric methods to characterize prior uncertainty in the data and allows for borrowing strength when appropriate while also accommodating heterogeneities between data sources.

(an ML talk in the AMS Speaker Series)

Fri 03/04/16, 12:00pm, Hackerman B17**Towards Intelligent Audio Analysis and Understanding***John Hershey, MERL*

**Abstract:** We address the problem of acoustic source separation in a deep learning framework we call “deep clustering.” Deep learning has recently produced major improvements in speech enhancement tasks in which the speech and interference belong to distinct classes of signal. In this case, a deep network classifier labels time-frequency regions of the signal according to the class of the dominant source, and separation is achieved by reconstructing the corresponding regions. However, such classification-based approaches completely fail to learn in “cocktail party” scenarios, where the interference is also speech. We present an alternative method that generates relation-preserving embedding vectors, one for each time-frequency region of the spectrogram, such that their distances represents the graph structure of the desired solution. For speech separation, the graph defines the segmentation of the spectrogram into regions corresponding to each source, and its representation is decoded by clustering the embeddings. The embedded representation is thus flexible with respect to the number of clusters and is invariant to their permutations. This method can be compared to spectral clustering, which uses simple kernel features to represent high-rank affinities and decodes them using expensive spectral methods. Deep clustering instead uses powerful learned features to represent low-rank affinities that can be decoded using simple clustering methods. We present experiments showing speaker-independent separation of single channel speech mixtures that yields an astounding 10 dB average improvement in SNR to both speech signals after training on 30 hours of speech data. Even more surprisingly, the same model trained only on two speaker mixtures can separate three-speaker mixtures, indicating an unusual degree of generalization. An audio demonstration of the results will be given and future directions will be discussed.

**Bio:** Prior to joining MERL in 2010, John spent 5 years at IBM’s T.J. Watson Research Center in New York, where he led a team in noise robust speech recognition. He also spent a year as a visiting researcher in the speech group at Microsoft Research, after obtaining his Ph D from UCSD in the area of multi-modal machine perception. He is currently working on machine learning for signal separation, speech recognition, language processing, and adaptive user interfaces.

(an ML talk in the CLSP Speaker Series)

Thu 03/03/16, 01:30pm, Whitehead 304**Mediation: From Intuition to Data Analysis***Ilya Shpitser, Johns Hopkins University*

**Abstract:** Modern causal inference links the “top-down” representation of causal intuitions and “bottom-up” data analysis with the aim of choosing policy. Two innovations that proved key for this synthesis were a formalization of Hume’s counterfactual account of causation using potential outcomes (due to Jerzy Neyman), and viewing cause effect relationships via directed acyclic graphs (due to Sewall Wright). I will briefly review how a synthesis of these two ideas was instrumental in formally representing the notion of “causal effect” as a parameter in the language of potential outcomes, and discuss a complete identification theory linking these types of causal parameters and observed data, as well as approaches to estimation of the resulting statistical parameters. I will then describe, in more detail, how my collaborators and I are applying the same approach to mediation, the study of effects along particular causal pathways. I consider mediated effects at their most general: I allow arbitrary models, the presence of hidden variables, multiple outcomes, longitudinal treatments, and effects along arbitrary sets of causal pathways. As was the case with causal effects, there are three distinct but related problems to solve — a representation problem (what sort of potential outcome does an effect along a set of pathways correspond to), an identification problem (can a causal parameter of interest be expressed as a functional of observed data), and an estimation problem (what are good ways of estimating the resulting statistical parameter). I report a complete solution to the first two problems, and progress on the third. In particular, my collaborators and I show that for some parameters that arise in mediation settings, triply robust estimators exist, which rely on an outcome model, a mediator model, and a treatment model, and which remain consistent if any two of these three models are correct. Some of the reported results are a joint work with Eric Tchetgen, Caleb Miles, Phyllis Kanki, and Seema Meloni.

(an ML talk in the AMS Speaker Series)

Wed 03/02/16, 04:00pm, Gilman Hall, Room 50**Teaching Machines to See***Rene Vidal, Johns Hopkins University*

**Abstract:** In a lecture titled “Teaching Machines to See,” Professor René Vidal will describe his work developing mathematical methods that enable computers to see, analyze, and interpret images, videos, and biomedical data. Vidal directs the Vision Dynamics and Learning Lab, which is part of the Center for Imaging Science (CIS). The Don P. Giddens Inaugural Professorial Lecture series is named for the fifth dean of the Whiting School of Engineering and started in 1993 to honor newly promoted full professors.

Tue 02/23/16, 01:30pm, Whitehead 304**Structure-Enhancing Algorithms for Statistical Learning Problems***Paul Grigas, MIT*

**Abstract:** For many problems in statistical machine learning and data-driven decision-making, massive datasets necessitate the use of scalable algorithms that deliver sensible (interpretable) and statistically sound solutions. In this talk, we discuss several scalable algorithms that directly promote well-structured solutions in two related contexts: (i) sparse high-dimensional linear regression, and (ii) low-rank matrix completion, both of which are particularly relevant in modern machine learning. In the context of linear regression, we study several boosting algorithms – which directly promote sparse solutions – from the perspective of modern first-order methods in convex optimization. We use this perspective to derive the first-ever computational guarantees for existing boosting methods and to develop new algorithms with associated computational guarantees as well. In the context of matrix completion, we present an extension of the Frank-Wolfe method in convex optimization that is designed to induce near-optimal low-rank solutions for regularized matrix completion problems, and we derive computational guarantees that trade-off between low-rank structure and data fidelity. For both problem contexts, we present computational results using datasets from microarray and recommender system applications.

(an ML talk in the AMS Speaker Series)

Tue 02/23/16, 12:00pm, Hackerman B17**Future (?) of Machine Translation***KyungHyun Cho, New York University*

**Abstract:** It is quite easy to believe that the recently proposed approach to machine translation, called neural machine translation, is simply yet another approach to statistical machine translation. This belief may drive research effort toward (incrementally) improving the existing neural machine translation system to outperform, or perform comparably to, the existing variants of phrase-based systems. In this talk, I aim to convince you otherwise. I argue that neural machine translation is not here to compete against the existing translation systems, but to open new opportunities in the field of machine translation. I will discuss three opportunities; (1) sub-word-level translation, (2) larger-context translation and (3) multilingual translation.

**Bio:** Kyunghyun Cho is an assistant professor of Computer Science and Data Science at New York University (NYU). Previously, he was a postdoctoral researcher at the University of Montreal under the supervision of Prof. Yoshua Bengio after obtaining a doctorate degree at Aalto University (Finland) in early 2014. Kyunghyun’s main research interests include neural networks, generative models and their applications, especially, to language understanding.

(an ML talk in the CLSP Speaker Series)

Thu 02/18/16, 01:30pm, Whitehead 304**Robust and Efficient Collocation Methods for Parameterized Models***Akil Narayan, University of Utah*

**Abstract:** Monte Carlo (MC) methods for the construction of polynomial approximations are effective tools for building a computational surrogate of the parametric variation for a model response. In this talk we investigate least-squares regularization of noisy data and compressive sampling recovery of sparse representations. We wish to minimize the number of samples required for a stable and accurate procedure. We propose an algorithm for a particular kind of weighted Monte Carlo approximation method based on sampling from the pluripotential equilibrium measure. Standard MC methods suffer from poor stability and accuracy for high-order approximations, but the properties of the equilibrium measure allow us to derive quasi-optimal statements of mathematical recoverability in both over- or undersampled regression problems. We also show that such an approach typically yields very stable, high-order computational algorithms for parameterized PDE approximation. We present theoretical analysis to motivate the algorithm, and numerical results to illustrate that equilibrium measure-based approaches are superior to standard MC methods in many situations of interest, notably in high-dimensional scenarios.

(an ML talk in the AMS Speaker Series)

Thu 02/18/16, 12:15pm, Room @5030, School of Public Health**Novel Statistical Frameworks for Analysis of Structured Sequential Data***Abhra Sarkar, Duke University*

**Abstract:** We are developing a broad array of novel statistical frameworks for analyzing complex sequential data sets. Our research is primarily motivated by a collaboration with neuroscientists trying to understand the neurological, genetic and evolutionary basis of human communication using bird and mouse models. The data sets comprise structured sequences of syllables or `songs’ produced by animals from different genotypes under different experimental conditions. Simple first order Markov chains are insufficiently flexible to learn complex serial dependency structures and systematic patterns in the vocalizations, an important goal in these studies. To this end, we have developed a sophisticated nonparametric Bayesian approach to higher order Markov chains building on probabilistic tensor factorization techniques. Our proposed method is of very broad utility, with applications not limited to analysis of animal vocalizations, and provides new insights into the serial dependency structures of many previously analyzed sequential data sets arising from diverse application areas. Our method has appealing theoretical properties and practical advantages, and achieves substantial gains in performance compared to previously existing methods. Our research also paves the way to advanced automated methods for more sophisticated dynamical systems, including higher order hidden Markov models that can accommodate more general data types.

(an ML talk in the Biostatistics Speaker Series)

Tue 02/16/16, 01:30pm, Clark 314**Universality Laws for Randomized Dimension Reduction***Joel A. Tropp, California Institute of Technology*

**Abstract:** Dimension reduction is the process of embedding high-dimensional data into a lower dimensional space to facilitate its analysis. In the Euclidean setting, one fundamental technique for dimension reduction is to apply a random linear map to the data. The question is how large the embedding dimension must be to ensure that randomized dimension reduction succeeds with high probability. This talk describes a phase transition in the behavior of the dimension reduction map as the embedding dimension increases. The location of this phase transition is universal for a large class of datasets and random dimension reduction maps. Furthermore, the stability properties of randomized dimension reduction are also universal. These results have many applications in numerical analysis, signal processing, and statistics. This is joint work with Samet Oymak.

**Bio:** Joel A. Tropp is Professor of Applied & Computational Mathematics at the California Institute of Technology. He earned the Ph.D. degree in Computational Applied Mathematics from the University of Texas at Austin in 2004. His research centers on signal processing, numerical analysis, and random matrix theory. Prof. Tropp won the 2008 Presidential Early Career Award for Scientists and Engineers. He received society best paper awards from SIAM in 2010, EUSIPCO in 2011, and IMA in 2015. He was also recognized as a Thomson Reuters Highly Cited Researcher in Computer Science in 2014 and 2015.

(an ML talk in the CIS Speaker Series)

Tue 02/16/16, 12:00pm, Hackerman B17**Multimodal Question Answering for Language and Vision***Richard Socher, MetaMind*

**Abstract:** Deep learning enabled tremendous breakthroughs in visual understanding and speech recognition. Ostensibly, this is not the case in natural language processing (NLP) and higher level reasoning. However, it only appears that way because there are so many different tasks in NLP and no single one of them, by itself, captures the complexity of language understanding. In this talk, I introduce dynamic memory networks which are our attempt to solve a large variety of NLP and visions problems through the lense of question answering.

**Bio:** Richard Socher is the CEO and founder of MetaMind, a startup that seeks to improve artificial intelligence and make it widely accessible. He obtained his PhD from Stanford working on deep learning with Chris Manning and Andrew Ng and won the best Stanford CS PhD thesis award. He is interested in developing new AI models that perform well across multiple different tasks in natural language processing and computer vision. He was awarded the Distinguished Application Paper Award at the International Conference on Machine Learning (ICML) 2011, the 2011 Yahoo! Key Scientific Challenges Award, a Microsoft Research PhD Fellowship in 2012 and a 2013 “Magic Grant” from the Brown Institute for Media Innovation and the 2014 GigaOM Structure Award.

(an ML talk in the CLSP Speaker Series)

Fri 02/12/16, 12:15pm, Room W3008 Bloomberg School of Public Health**Nearest Neighbor Gaussian Process Models for Massive Spatial and Spatio-temporal Data***Abhirup Datta, Division of Biostatistics, University of Minnesota*

**Abstract:** Gaussian process (GP) models are widely used for analyzing space and space-time indexed data from forestry, environmental health, climate sciences etc. However, traditional GP models entail computations that become prohibitive for modern geostatistical datasets with large number of spatial or temporal locations. In this talk, I will present our proposed Nearest-neighbor Gaussian process (NNGP) models which provide a highly scalable alternative for fully model based inference for massive spatial and spatio-temporal datasets. NNGP is a well-defined spatial process and can be used as a sparsity-inducing prior for spatial or spatio-temporal random effects within a rich hierarchical modeling framework. Matrix-free Markov chain Monte Carlo (MCMC) algorithms for NNGP deliver massive scalability. NNGP effectively reproduces the corresponding inference from traditional (but highly expensive) GP models. I will also discuss applications of NNGP to massive scale prediction of forest biomass and analysis of air pollution data.

(an ML talk in the Biostatistics Speaker Series)

Tue 02/09/16, 12:00pm, Clark 314**Where to Buy It: Matching Street Clothing Photos in Online Shops and Visual Madlibs: Fill in the Blank Image Generation and Question Answering***Alex Berg, University of North Carolina at Chapel Hill*

**Abstract:** Where to Buy It – We define a new task, Exact Street to Shop, where our goal is to match a real-world example of a garment item to the same item in an online shop.. We develop three different methods for Exact Street to Shop retrieval, including two deep learning baseline methods, and a method to learn a similarity measure between the street and shop domains. Experiments demonstrate that our learned similarity significantly outperforms our baselines that use existing deep learning based representations. Visual Madlibs – We introduce a new dataset consisting of 360,001 focused natural language descriptions for 10,738 images. This dataset, the Visual Madlibs dataset, is collected using automatically produced fill-in-the-blank templates designed to gather targeted descriptions about: people and objects, their appearances, activities, and interactions, as well as inferences about the general scene or its broader context. We provide several analyses of the Visual Madlibs dataset and demonstrate its applicability to two new description generation tasks: focused description generation, and multiple-choice question answering for images. Experiments using joint-embedding and deep learning methods show promising results on these tasks

**Bio:** Alex Berg’s research concerns computational visual recognition. He has worked on general object recognition in images, action recognition in video, human pose identification in images, image parsing, face recognition, image search, and machine learning for computer and human vision. He co-organizes the ImageNet Large Scale Visual Recognition Challenge, and organized the first Large-Scale Learning for Vision workshop. He is currently an assistant professor in computer science at UNC Chapel Hill. Prior to that he was on the faculty at Stony Brook University, a research scientist at Columbia University, and research scientist at Yahoo! Research. His PhD at U.C. Berkeley developed a novel approach to deformable template matching. He earned a BA and MA in Mathematics from Johns Hopkins University and learned to race sailboats at SSA in Annapolis. In 2013, his work received the Marr prize.

(an ML talk in the CIS Speaker Series)

Tue 02/02/16, 12:00pm, Hackerman B17**Binary and Multiclass Calibration in Speaker and Language Recognition***Niko Brummer, AGNITIO*

**Abstract:** Automatic pattern classifiers that output soft, probabilistic classifications—rather than hard decisions—can be more widely and more profitably applied, provided the probabilistic output is well-calibrated. In the fields of automatic speaker recognition and automatic spoken language recognition, the regular NIST technology evaluations have placed a strong emphasis on cost effective application and therefore on calibration. This talk will describe calibration solutions for these technologies, with emphasis on criteria for measuring the goodness of calibration—if we can measure it, we can also optimize it. The core of the talk is a derivation and a re-interpretation of cross-entropy, which is the standard objective function in machine learning for the supervised training of classifiers. The main theoretical result is that cross-entropy represents the expected cost of making minimum-expected-cost Bayes decisions, based on the outputs of a softmax classifier. For this equivalence we use a special misclassification cost function, defined over a smooth range of cost values. In practice this means that classifiers trained with cross-entropy can be expected to work well over a wide range of different applications.

**Bio:** Niko Brummer received B. Eng (1986), M. Eng (1988) and Ph.D. (2010) degrees, all in electronic engineering, from Stellenbosch University. He worked as a researcher at DataFusion (later called Spescom DataVoice) and is currently chief scientist at AGNITIO. Most of his research for the last two decades has been applied to automatic speaker and language recognition and he has been participating in most of the NIST SRE and LRE evaluations in these technologies, from the year 2000 to the present. He has been contributing to the Odyssey Workshop series since 2001 and was organizer of Odyssey 2008 in Stellenbosch. His FoCal Toolkit is widely used for fusion and calibration in speaker and language recognition research. His research interests include development of new algorithms for speaker and language recognition, as well as evalution methodologies for these technologies. In both cases, his emphasis is on probabilistic modelling. He has worked with both generative (eigenchannel, JFA, i-vector PLDA) and discriminative (system fusion, disciminative JFA and PLDA) recognizers. In evaluation, his focus is on judging the goodness of classifiers that produce probabilistic outputs in the form of well calibrated class likelihoods.

(an ML talk in the CLSP Speaker Series)

Mon 02/01/16, 12:15pm, Room W3008, School of Public Health**Robust Causal Inference with Continuous Exposures***Edward Kennedy, University of Pennsylvania, Perelman School of Medicine*

**Abstract:** Continuous treatments (e.g., doses) arise often in practice, but standard causal effect estimators are limited: they either employ parametric models for the effect curve, or else do not allow for doubly robust covariate adjustment. Double robustness allows one of two nuisance estimators to be misspecified, and is important for protecting against model misspecification as well as reducing sensitivity to the curse of dimensionality. In this work we present a novel approach for causal dose-response curve estimation that is doubly robust without requiring any parametric assumptions, and which naturally incorporates general off-the-shelf machine learning. We derive asymptotic properties for a kernel-based version of our approach and propose a method for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of hospital nurse staffing on excess readmissions penalties.

(an ML talk in the Biostatistics Speaker Series)

Fri 01/29/16, 12:15pm, Room W3008, School of Public Health**Precision Medicine, Learning Health Systems, and Improving Surveillance of Low Risk Prostate Cancer***Yates Coley, Johns Hopkins Bloomberg School of Public Health*

**Abstract:** We present a project from the Johns Hopkins Individualized Health Initiative to support a personalized prostate cancer management program. For individuals with a diagnosis of low risk prostate cancer, active surveillance offers an alternative to early curative intervention. The success of surveillance depends on being able to effectively distinguish indolent tumors from those with metastatic potential, a characteristic that cannot be directly observed without surgical removal of the prostate. We have developed a Bayesian hierarchical model for prediction of an individual’s latent cancer state by integrating multiple sources of data collected in the practice of active surveillance. Existing modeling approaches are extended to accommodate measurement error in cancer state determinations based on biopsied tissue and to allow observations to possibly be missing not at random. Predictions can be updated in real time with an importance sampling algorithm and communicated with patients and clinicians through a decision support tool. Integration of the model into the clinical workflow will automate model estimation and enable a continuously learning prediction model.

(an ML talk in the Biostatistics Speaker Series)

Thu 01/28/16, 01:30pm, Whitehead 304**Feature Allocations, Probability Functions and Paintboxes***Tamara Broderick, MIT*

**Abstract:** Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as “feature allocation.” The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We establish an analogous characterization for feature allocation; we define notions of “exchangeable feature probability functions” and “feature paintboxes” that lead to a Bayesian framework that does not require the number of features to be fixed a priori. The second step is a computational one. Rather than appealing to Markov chain Monte Carlo for Bayesian inference, we develop a method to transform Bayesian methods for feature allocation (and other latent structure problems) into optimization problems with objective functions analogous to K-means in the clustering setting. These yield approximations to Bayesian inference that are scalable to large inference problems.

**Bio:** Tamara is the ITT career development Assistant Professor in the Electrical Engineering and Computer Science (EECS) department at MIT. She is also a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), the Institute for Data, Systems, and Society (IDSS), and MachineLearning@MIT. Before moving to MIT, she completed her PhD at UC Berkeley with Michael I. Jordan. She works in the areas of machine learning and statistics, particularly in Bayesian inference and graphical models with an emphasis on scalable, nonparametric and unsupervised learning. You can find more about her research here: http://www.tamarabroderick.com/

(an ML talk in the AMS Speaker Series)

Tue 01/26/16, 12:00pm, Hackerman B17**Interactive Training of Relation Embeddings Using High-Level Supervision***Sameer Singh, University of Washington*

**Abstract:** An important challenge in extracting useful structured information from text collections is relation extraction, i.e. identifying the types of relations between entities that are expressed in text. Due to the variety in how relations are rendered in language, labeling data for relation extraction is unfortunately notoriously time-consuming and expensive. Recently proposed embedding-based extractors that utilize unlabeled data and use noisy KB alignments as “distant labels” partially address this concern. However not only are these models inaccurate for relations that do not have large KB, but further, cannot be improved upon without annotating data. Purely rule-based systems, on the other hand, provide an attractive alternative as they allow users to directly inject symbolic domain knowledge, however require a large number of formulae to achieve reasonable generalization. In this talk, I introduce an interactive training paradigm that combines embedding-based models of relation extraction with symbolic domain knowledge. I first describe how symbolic domain knowledge, if provided by the user as first-order logic statements, can be injected into the embeddings to improve the predictions. In the second part of the talk, I present an approach to “explain” the embedding-based model predictions using a symbolic representation, which the user can annotate directly for more effective supervision. I present experiments that demonstrate the potential of symbolic knowledge as supervision in reducing annotation effort and in quickly training accurate relation extraction systems. This work is a collaboration with Tim Rocktaschel, Sebastian Riedel, Luke Zettlemoyer, and Carlos Guestrin.

**Bio:** Sameer Singh is a Postdoctoral Research Associate at the University of Washington, working with Carlos Guestrin, Luke Zettlemoyer, and Dan Weld on large-scale and interactive machine learning applied to information extraction and natural language processing. He received his PhD from the University of Massachusetts, Amherst in 2014, where he worked with Andrew McCallum on scalable inference for large graphical models and probabilistic programming. He was recently selected as a DARPA Riser, won the grand prize in the Yelp dataset challenge in 2015, has been awarded the Yahoo! KSC fellowship and the UMass Graduate School fellowship, and was a finalist for the Facebook PhD fellowship. Sameer’s internships at Microsoft Research, Google Research, and Yahoo! Labs involved designing machine learning algorithms for massive datasets. He is one of the founding organizers of the popular NIPS Big Learning and ICML Inferning workshops, and has been organizing the Automated Knowledge-Base Construction (AKBC) workshops in 2013, 2014, and 2016.

(an ML talk in the CLSP Speaker Series)

Fri 01/15/16, 12:15pm, Room W4030, School of Public Health**Statistical Methods for Inference in High Dimensional “Omics” Data***Ni Zhao, Fred Hutchinson Cancer Research Center*

**Abstract:** Recent advances in high-throughput biotechnology have enabled multiple platform “omics” profiling of biological samples. In this talk, I will present two statistical inference methods for testing the association between high-dimensional “omics” data and a phenotype of interest. In the first method, we utilized the kernel machine regression framework to analyze large- scale microbiome composition data [Zhao et al, American Journal of Human Genetics, 2015]. We constructed kernels that can incorporate the phylogenetic information in the microbiome data. We further developed a robust omnibus test that can incorporate multiple kernels. In the second method, we developed a powerful likelihood ratio test via the composite kernel machine regression in Genome Wide Association Studies [Zhao et al, Biometrics, Prepare for Submission]. The method tests the association between multiple SNPs with the phenotype, considering possible gene-environment interaction. I illustrate the utility of both methods using simulations and with application to data from real studies.

(an ML talk in the Biostatistics Speaker Series)

Mon 01/11/16, 12:15pm, Room W2008, School of Public Health**High-Dimensional Matrix Linear Regression Model***Dehan Kong, University of North Carolina, Chapel Hill*

**Abstract:** We develop a high-dimensional matrix linear regression model (HMLRM) to correlate matrix responses with high-dimensional scalar covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm to deal with the case that the dimension of scalar covariates is ultra-high. We develop an efficient estimation procedure based on the nuclear norm regularization, which explicitly borrows the matrix structure of coefficient matrices. We systematically investigate various theoretical properties of our estimators, including estimation consistency, rank consistency, and the sure independence screening property under HMLRM. We examine the finite- sample performance of our methods using simulations and a large-scale imaging genetic dataset collected by the Alzheimer’s Disease Neuroimaging Initiative study.

(an ML talk in the Biostatistics Speaker Series)

Tue 01/05/16, 12:15pm, Room W4030, School of Public Health**Robust Bayesian Inference via Coarsening***Jeffrey Miller, Duke University*

**Abstract:** The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure, particularly when the data set is large. We introduce a simple, coherent approach to Bayesian inference that improves robustness to small departures from the model: rather than condition on the observed data exactly, one conditions on the event that the model generates data close to the observed data, with respect to a given statistical distance. When closeness is defined in terms of relative entropy, the resulting “coarsened posterior” can be approximated by simply raising the likelihood to a certain fractional power, making the method computationally efficient and easy to implement in practice. We illustrate with real and simulated data, and provide theoretical results.

(an ML talk in the Biostatistics Speaker Series)

Thu 12/17/15, 01:15pm, Malone Hall 328**A Unified Framework for Large-Scale Block-Structured Optimization***Mingyi Hong, Iowa State University*

**Abstract:** In this talk we present a powerful algorithmic framework for large-scale optimization, called Block Successive Upper bound Minimization (BSUM). The BSUM includes as special cases many well-known methods for signal processing, communication or massive data analysis, such as Block Coordinate Descent (BCD), Convex-Concave Procedure (CCCP), Block Coordinate Proximal Gradient (BCPG) method, Nonnegative Matrix Factorization (NMF), Expectation Maximization (EM) method and so on. In this talk, various features and properties of the BSUM are discussed from the viewpoint of design flexibility, computational efficiency and parallel/distributed implementation. Illustrative examples from networking, signal processing and machine learning are presented to demonstrate the practical performance of the BSUM framework.

**Bio:** Mingyi Hong received his B.E. degree in Communications Engineering from Zhejiang University, China, in 2005, his M.S. degree in Electrical Engineering from Stony Brook University in 2007, and Ph.D. degree in Systems Engineering from University of Virginia in 2011. From 2011 to 2014 he held research positions at the Department of Electrical and Computer Engineering, University of Minnesota. He is currently a Black & Veatch Faculty Fellow and an Assistant Professor with the Department of Industrial and Manufacturing Systems Engineering, Iowa State University. His research interests are primarily in the fields of large-scale optimization theory, and its application in statistical signal processing, next generation wireless networking, and big data related problems.

(an ML talk in the CLSP Speaker Series)

Wed 12/09/15, 12:15pm, Bloomberg School of Public Health, Room W5030**Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity***Veronika Rockova, The Wharton School, University of Pennsylvania*

**Abstract:** Rotational post-hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations and (c) better oriented sparse solutions. To avoid the pre-specification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian Buffet Process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the Spike-and-Slab LASSO prior, a two-component refinement of the Laplace prior (Rockova 2015). A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional gene expression data, which would render posterior simulation impractical.

(an ML talk in the Biostatistics Speaker Series)

Tue 12/08/15, 10:00am, Charles Commons, Barber Conference Room**(Data-driven) Strategies to Predict Intra- and Inter- Cellular Signaling Dynamics and Function***Neda Bagheri, Northwestern University*

**Abstract:** In the past decade, emerging technologies have offered increasingly high throughput data with greater resolution to investigate cellular responses. To gain insight from dynamic gene expression, transcription factor activity, phospho-signaling or other data, improved computational strategies to analyze, integrate, and predict complex biological function must be developed. We employ a variety of inference and modeling algorithms to investigate the temporal and multifunctional evolution of various cellular responses. By developing predictive models that are informed by new experimental tools, we aim to resolve regulatory pathways responsible for complex biological response and cell fate decisions. In this manner, we can generate informed hypotheses on the mechanism of action of potential drug candidates and gain insight for improved efficacy/specificity of treatment strategies, providing a unique opportunity to predict and modulate biological responses.

**Bio:** Neda Bagheri earned a doctorate in Electrical Engineering from the University of California Santa Barbara. Her interest in computational and systems biology was piqued by the fact that, not surprisingly, many of the principles (i.e., regulatory motifs) commonly employed in control theory and dynamical systems are intrinsic to biology. As a result, she pursued a postdoc in Biological Engineering at MIT, where she worked closely with experimentalists and a clinical oncologist to investigate cancer and immune cell signaling. Now, as a member of the faculty in Chemical & Biological Engineering at Northwestern University, Neda continues her mission to integrate experimental data with novel computational strategies to elucidate complex intra-cellular dynamics and inter-cellular regulation. Neda Bagheri’s long-term goal is to resolve signal and information processing in complex regulatory networks, and identify control policies to effectively modulate biological function.

(an ML talk in the Biostatistics Speaker Series)

Tue 12/01/15, 01:30pm, Clark 314**A Segmentation Algorithm for Efficient Neural Reconstruction from Electron Microscopy Data***Toufiq Parag, Janelia Farm Research Campus*

**Abstract:** Recent efforts of neural reconstruction from Electron Microscopy (EM) images revealed fundamental knowledge about structure and function of animal brain. Automatic segmentation algorithms play the pivotal role behind the successes of many of these studies. In this talk, I will discuss what characteristics of segmentation algorithms expedite the overall reconstruction process and present a method for achieving these qualities. The proposed algorithm relies on an active learning strategy to train the necessary classifiers from a small set of groundtruth examples. Our technique emphasizes on minimizing under-segmentation errors in order to accelerate the error correction tasks. Although developed primarily for EM segmentation, elements of the proposed algorithm can be utilized in problems from other biological domains as well.

(an ML talk in the CIS Speaker Series)

Tue 11/24/15, 10:45am, Hackerman B17**Data Centers, Energy and Online Optimization***Adam Wierman, California Institute of Technology*

**Abstract:** This talk will tell two stories, one about designing sustainable data centers and one about the underlying algorithmic challenges, which fall into the context of online convex optimization. Story 1: The typical message surrounding datacenters and energy is an extremely negative one: Data centers are energy hogs. This message is pervasive in both the popular press and academia, and it certainly rings true. However, the view of datacenters as energy hogs is too simplistic. One goal of this talk is to highlight that, yes, data centers use a lot of energy, but data centers can also be a huge benefit in terms of integrating renewable energy into the grid and thus play a crucial role in improving the sustainability of our energy landscape. In particular, I will highlight a powerful alternative view: data centers as demand response opportunities. Story 2: Typically in online convex optimization it is enough to exhibit an algorithm with low (sub-linear) regret, which implies that the algorithm can match the performance of the best static solution in retrospect. However, what if one additionally wants to maintain performance that is nearly as good as the dynamic optimal, i.e., a good competitive ratio? In this talk, I’ll highlight that it is impossible for an online algorithm to simultaneously achieve these goals. Luckily though, in practical settings (like data centers), noisy predictions about the future are often available, and I will show that, under a general model of prediction noise, even very limited predictions about the future are enough to overcome the impossibility result.

**Bio:** Adam Wierman is a Professor in the Department of Computing and Mathematical Sciences at the California Institute of Technology, where he is a founding member of the Rigorous Systems Research Group (RSRG) and maintains a popular blog called Rigor + Relevance. His research interests center around resource allocation and scheduling decisions in computer systems and services. He received the 2011 ACM SIGMETRICS Rising Star award, the 2014 IEEE Communications Society William R. Bennett Prize, and has been coauthor on papers that received of best paper awards at ACM SIGMETRICS, IEEE INFOCOM, IFIP Performance (twice), IEEE Green Computing Conference, IEEE Power & Energy Society General Meeting, and ACM GREENMETRICS

(an ML talk in the CIS Speaker Series)

Fri 11/20/15, 12:00pm, Hackerman B17**Linear Methods for Linguistics***Dean Foster, Amazon.com*

**Abstract:** Using random matrix theory, we now have some very easy to understand and fast to use methods of computing low rank representations of matrices. I have been using these methods as a hammer to improve several statistical methods: stepwise regression, CCAs, and HMM. I’ll discuss a few of these in this talk and how they can be connected to problems in NLP.

**Bio:** Much of Dean’s current work is on statistical approaches to NLP problems and other issues in big data. He has come up with several algorithms for fast variable selection in regressions and has proven these to have nice theoretical properties. He has used vector models for words to allow them to be more easily manipulated using statistical technology. These often end up using spectral techniques, for example, as he used them to fit HMMs and probabilistic CFG.

(an ML talk in the CLSP Speaker Series)

Tue 11/17/15, 01:30pm, Clark 314**Robust and Scalable Approach to Bayesian Inference***Stanislav Minsker, University of Southern California*

**Abstract:** Contemporary data analysis problems pose several general challenges. One is related to resource limitations: massive data require computer clusters for storage and processing. Another problem occurs when available observations are contaminated by “outliers” that are not easily identified and removed. An attempt to address these problems raises natural questions: can we design estimation techniques that (i) admit strong theoretical performance guarantees even when data contains “corrupted” observations; (ii) are scalable and can be implemented in parallel while preserving the quality of estimation? As a step towards answering these questions, we will explain how to compute medians in the space of probability measures, and introduce an alternative to exact Bayesian inference based on the “M-posterior” distribution. We will describe theoretical properties of the M-posterior, and illustrate our approach via several examples. The talk will be self-contained and aimed at the broad audience; it is based on the joint work with D. Dunson, L. Lin and S. Srivastava.

**Bio:** Stanislav Minsker is an Assistant Professor in the Department of Mathematics at the University of Southern California which he joined in 2015. Prior to joining USC, he worked at Duke University and in Wells Fargo Securities. His research interests lie in the areas of High-Dimensional Statistics, Statistical Learning theory and concentration-of-measure inequalities.

(an ML talk in the CIS Speaker Series)

Thu 11/12/15, 01:30pm, Whitehead 304**Scaling and Generalizing Variational Inference***David Blei, Columbia University*

**Abstract:** Latent variable models have become a key tool for the modern statistician, letting us express complex assumptions about the hidden structures that underlie our data. Latent variable models have been successfully applied in numerous fields. The central computational problem in latent variable modeling is posterior inference, the problem of approximating the conditional distribution of the latent variables given the observations. Posterior inference is central to both exploratory tasks and predictive tasks. Approximate posterior inference algorithms have revolutionized Bayesian statistics, revealing its potential as a usable and general-purpose language for data analysis. Bayesian statistics, however, has not yet reached this potential. First, statisticians and scientists regularly encounter massive data sets, but existing approximate inference algorithms do not scale well. Second, most approximate inference algorithms are not generic; each must be adapted to the specific model at hand. In this talk I will discuss our recent research on addressing these two limitations. I will describe stochastic variational inference, an approximate inference algorithm for handling massive data sets. I will demonstrate its application to probabilistic topic models of text conditioned on millions of articles. Then I will discuss black box variational inference. Black box inference is a generic algorithm for approximating the posterior. We can easily apply it to many models with little model-specific derivation and few restrictions on their properties. I will demonstrate its use on a suite of nonconjugate models of longitudinal healthcare data. This is joint work based on these two papers: M. Hoffman, D. Blei, J. Paisley, and C. Wang. Stochastic variational inference. Journal of Machine Learning Research, 14:1303-1347. R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. Artificial Intelligence and Statistics, 2014.

**Bio:** David Blei is a Professor of Statistics and Computer Science at Columbia University, and a member of the Columbia Data Science Institute. His research is in statistical machine learning, involving probabilistic topic models, Bayesian nonparametric methods, and approximate posterior inference algorithms for massive data. He works on a variety of applications, including text, images, music, social networks, user behavior, and scientific data. David has received several awards for his research, including a Sloan Fellowship (2010), Office of Naval Research Young Investigator Award (2011), Presidential Early Career Award for Scientists and Engineers (2011), Blavatnik Faculty Award (2013), and ACM-Infosys Foundation Award (2013).

(an ML talk in the AMS Speaker Series)

Tue 11/10/15, 12:00pm, Hackerman B17**Better Science Through Better Bayesian Computation***Ryan Adams, Harvard University*

**Abstract:** As we grapple with the hype of “big data” in computer science, it is important to remember that the data are not the central objects: we collect data to answer questions and inform decisions in science, engineering, policy, and beyond. In this talk, I will discuss my work in developing tools for large-scale data analysis, and the scientific collaborations in neuroscience, chemistry, and astronomy that motivate me and keep this work grounded. I will focus on two lines of research that I believe capture an important dichotomy in my work and in modern probabilistic modeling more generally: identifying the “best” hypothesis versus incorporating hypothesis uncertainty. In the first case, I will discuss my recent work in Bayesian optimization, which has become the state-of-the-art technique for automatically tuning machine learning algorithms, finding use across academia and industry. In the second case, I will discuss scalable Markov chain Monte Carlo and the new technique of Firefly Monte Carlo, which is the first provably correct MCMC algorithm that can take advantage of subsets of data.

**Bio:** Ryan Adams is Head of Research at Twitter Cortex and an Assistant Professor of Computer Science at Harvard. He received his Ph.D. in Physics at Cambridge as a Gates Scholar. He was a CIFAR Junior Research Fellow at the University of Toronto before joining the faculty at Harvard. He has won paper awards at ICML, AISTATS, and UAI, and his Ph.D. thesis received Honorable Mention for the Savage Award for Theory and Methods from the International Society for Bayesian Analysis. He also received the DARPA Young Faculty Award and the Sloan Fellowship. Ryan was the CEO of Whetlab, a machine learning startup that was recently acquired by Twitter, and co-hosts the Talking Machines podcast.

(an ML talk in the CLSP Speaker Series)

Tue 11/03/15, 12:00pm, Hackerman B17**Towards Fast Autonomous Learners***Emma Brunskill, Carnegie Mellon University*

**Abstract:** A fundamental goal of artificial intelligence is to create agents that learn to make good decisions as they interact with a stochastic environment. Some of the most exciting and valuable potential applications involve systems that interact directly with humans, such as intelligent tutoring systems or medical interfaces. In these cases, sample efficiency is highly important, as each decision, good or bad, is impacting a real person. I will describe our research on tackling this challenge, including transfer learning across sequential decision making tasks, as well as its relevance to improving educational tools.

**Bio:** Emma Brunskill is an assistant professor in the computer science department at Carnegie Mellon University. She is also affiliated with the machine learning department at CMU. She works on reinforcement learning, focusing on applications that involve artificial agents interacting with people, such as intelligent tutoring systems. She is a Rhodes Scholar, Microsoft Faculty Fellow and NSF CAREER award recipient, and her work has received best paper nominations in Education Data Mining (2012, 2013) and CHI (2014).

(an ML talk in the CLSP Speaker Series)

Wed 10/28/15, 10:00am, Clark 314**Global Optimality in Representation Learning***Ben Haeffele, JHU*

**Abstract:** Representation learning methods such as matrix factorization, tensor factorization, and neural networks have achieved considerable empirical success in many fields. However, common to a vast majority of these approaches are the significant disadvantages that 1) the associated optimization problems are typically non-convex due to a multilinear form or other convexity destroying transformation and 2) one is forced to specify the size of the learned representation a priori. This talk will present a very general framework which allows for the analysis of a wide range of non-convex representation learning problems. The framework allows the derivation of sufficient conditions to guarantee that a local minimum of the non-convex optimization problem is a global minimum and that from any initialization it is possible to find a global minimizer using a purely local descent algorithm. Further, the framework also allows for a wide range of regularization to be incorporated into the model to capture known features of data and to adaptively fit the size of the learned representation to the data instead of defining it a priori. Multiple implications of this work are discussed as they relate to modern practices in deep learning.

(an ML talk in the CIS Speaker Series)

Tue 10/27/15, 12:00pm, Hackerman B17**Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks Using Tensor Methods***Anima Anandkumar, UC Irvine*

**Abstract:** Training neural networks is a highly non-convex problem and in general is NP-hard. Local search methods such as gradient descent get stuck in spurious local optima, especially in high dimensions. We present a novel method based on tensor decomposition that trains a two layer neural network with guaranteed risk bounds for a large class of target functions with polynomial sample and computational complexity. We also demonstrate how unsupervised learning can help in supervised tasks. In our context, we estimate probabilistic score functions via unsupervised learning which are then employed for training neural networks using tensor methods.

**Note:** Anima Anandkumar is a faculty at the EECS Dept. at U.C.Irvine since August 2010. Her research interests are in the area of large-scale machine learning and high-dimensional statistics. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a visiting faculty at Microsoft Research New England in 2012 and a postdoctoral researcher at the Stochastic Systems Group at MIT between 2009-2010. She is the recipient of the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, AFOSR & ARO Young Investigator Awards, NSF CAREER Award, IBM Fran Allen PhD fellowship, Best thesis award from ACM SIGMETRICS society, and paper awards from the ACM SIGMETRICS and IEEE Signal Processing societies.

(an ML talk in the CLSP Speaker Series)

Tue 10/20/15, 01:30pm, Clark 314**Signal Recovery from Scattering Convolutional***Joan Bruna, UC Berkeley*

**Abstract:** Scattering networks are a class of Convolutional Networks where the convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this talk, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.

**Bio:** Joan graduated cum-laude from Universitat Politècnica de Catalunya in both Mathematics and Electrical Engineering, and he obtained an MSc in Applied Mathematics from ENS Cachan. He then became a Sr. Research Engineer in an Image Processing startup, developing real-time video processing algorithms. In 2013 he obtained his PhD in Applied Mathematics at École Polytechnique, under the supervision of Prof. Stéphane Mallat. After a postdoctoral stay at the Courant Institute, NYU, New York, in Yann LeCun’s lab, he became a Postdoctoral fellow at Facebook AI Research. Since Jan 2015 he is an Assistant Professor at UC Berkeley, Statistics Department. His research interests include invariant signal representations, stochastic processes, harmonic analysis, deep learning, and its applications to computer vision.

(an ML talk in the CIS Speaker Series)

Tue 10/20/15, 01:30pm, Clark 314**Signal Recovery from Scattering Convolutional Networks***Joan Bruna, Dept of Statistics at UC Berkeley*

**Abstract:** Scattering networks are a class of Convolutional Networks where the convolutional filter banks are given by complex, multiresolution wavelet families. As a result of this extra structure, they are provably stable and locally invariant signal representations, and yield state-of-the-art classification results on several pattern and texture recognition problems where training examples may be limited. The reasons for such success lie on the ability to preserve discriminative information while generating stability with respect to high-dimensional deformations. In this talk, we will explore the discriminative aspect of the representation, giving conditions under which signals can be recovered from their scattering coefficients, as well as introducing a family of Gibbs scattering processes, from which one can sample image and auditory textures. Although the scattering recovery is non-convex and corresponds to a generalized phase recovery problem, gradient descent algorithms show good empirical performance and enjoy weak convergence properties. We will discuss connections with non-linear compressed sensing and applications to texture synthesis and inverse problems such as super-resolution.

**Bio:** Joan graduated cum-laude from Universitat Politècnica de Catalunya in both Mathematics and Electrical Engineering, and he obtained an MSc in Applied Mathematics from ENS Cachan. He then became a Sr. Research Engineer in an Image Processing startup, developing real-time video processing algorithms. In 2013 he obtained his PhD in Applied Mathematics at École Polytechnique, under the supervision of Prof. Stéphane Mallat. After a postdoctoral stay at the Courant Institute, NYU, New York, in Yann LeCun’s lab, he became a Postdoctoral fellow at Facebook AI Research. Since Jan 2015 he is an Assistant Professor at UC Berkeley, Statistics Department. His research interests include invariant signal representations, stochastic processes, harmonic analysis, deep learning, and its applications to computer vision. http://www.stat.berkeley.edu/~bruna/

(an ML talk in the CIS Speaker Series)

Tue 10/13/15, 12:00pm, Hackerman B17**Deep Learning in NLP and Beyond***Tomas Mikolov, Facebook*

**Abstract:** I will provide a short overview of some of the success stories in NLP that involve advanced machine learning techniques, such as recurrent neural networks. Then, I will try to explain what is the main motivation for researchers to be interested in the techniques and approaches currently known as Deep learning. Following this motivation, I will present some novel ideas and possible directions for future research that would bring us closer to development of machines that can understand natural language and communicate with us.

**Bio:** Tomas Mikolov is a research scientist at Facebook AI Research lab located in NY. His most influential work includes development of recurrent neural network language models and discovery of semantic regularities in distributed word representations. These projects have been published as open-source tools RNNLM and word2vec which have been since widely used both in academia and industry. His main research interest is to develop intelligent machines.

(an ML talk in the CLSP Speaker Series)

Thu 10/08/15, 01:30pm, Whitehead 304**Change Point***George Michailidis, University of Florida*

**Abstract:** We investigate a model of an Erdos-Renyi graph, where the edges can be in a present/absent state. The states of each edge evolve as a Markov chain independently of the other edges, and whose parameters exhibit a change-point behavior in time. We derive the maximum likelihood estimator for the change-point and characterize its distribution. Depending on a measure of the signal-to-noise ratio present in the data, different limiting regimes emerge. Nevertheless, a unifying adaptive scheme can be used in practice that covers all cases.We illustrate the model and its flexibility on US Congress voting patterns using roll call data.

(an ML talk in the AMS Speaker Series)

Thu 10/08/15, 01:30pm, Whitehead 304**Change Point Inference for Time-Varying Erdo-Renyi Graphs***George Michailidis, University of Florida*

**Abstract:** We investigate a model of an Erdos-Renyi graph, where the edges can be in a present/absent state. The states of each edge evolve as a Markov chain independently of the other edges, and whose parameters exhibit a change-point behavior in time. We derive the maximum likelihood estimator for the change-point and characterize its distribution. Depending on a measure of the signal-to-noise ratio present in the data, different limiting regimes emerge. Nevertheless, a unifying adaptive scheme can be used in practice that covers all cases.We illustrate the model and its flexibility on US Congress voting patterns using roll call data.

(an ML talk in the AMS Speaker Series)

Tue 10/06/15, 12:00pm, Hackerman B17**Neuromorphic Language Understanding***Guido Zarrella, MITRE Corporation*

**Abstract:** Recurrent neural networks are effective tools for processing natural language, and can be trained to perform sequence processing tasks such as translation, classification, language modeling, and paraphrase detection. But despite major gains in the training and application of artificial neural networks, it remains difficult to construct biologically-inspired models of cognition and language understanding. This talk will discuss recent work to bridge the gap between these fields. We will show how deep neural networks are being used to solve language understanding tasks, and demonstrate that many of these networks can be adapted to run on ultra-low power neuromorphic hardware which simulates the spiking of individual neurons. The resulting proof-of-concept, developed in collaboration at the 2015 Telluride Neuromorphic Engineering Workshop, is an interactive embedded system that uses recurrent neural networks to process language while consuming an estimated .00005 watts.

**Bio:** Guido Zarrella is a Principal Artificial Intelligence Engineer at the MITRE Corporation in Denver, Colorado. He leads an R&D team pursuing advances in deep learning for language understanding. He first began building automatic language learning systems at Carnegie Mellon University for his undergraduate research advisor Herbert A. Simon. His work today still focuses on unsupervised learning of meaning and intent in informal language.

(an ML talk in the CLSP Speaker Series)

Tue 09/29/15, 12:00pm, Hackerman B17**Learning and Mining in Large-Scale Time Series Data***Yan Liu, University of Southern California*

**Abstract:** Many emerging applications of machine learning involve time series and spatio-temporal data. In this talk, I will discuss a collection of machine learning approaches to effectively analyze and model large-scale time series and spatio-temporal data. Experiment results will be shown to demonstrate the effectiveness of our models in healthcare and climate applications.

**Bio:** Yan Liu is an assistant professor in the Computer Science Department at the University of Southern California since 2010. Before that, she was a Research Staff Member at IBM Research. She received her M.Sc and Ph.D. degree from Carnegie Mellon University in 2004 and 2007. Her research interest includes developing scalable machine learning and data mining algorithms with applications to social media analysis, computational biology, climate modeling and healthcare analytics. She has received several awards, including NSF CAREER Award, Okawa Foundation Research Award, ACM Dissertation Award Honorable Mention, Best Paper Award in SIAM Data Mining Conference, Yahoo! Faculty Award, IBM Faculty Award and the winner of several data mining competitions such as KDD Cup and INFORMS data mining competition.

(an ML talk in the CLSP Speaker Series)

Tue 09/29/15, 10:00am, Barber Conference Room at Charles Commons**Deep Learning for Regulatory Genomics***Anshul Kundaje, Stanford University*

**Abstract:** Deep neural network approaches such as Convolutional Neural Networks (CNNs) and Long-Term Short-Term Recurrent Neural Networks (LSTM-RNNs) have resulted in dramatic performance improvements for several learning tasks in Natural Language Processing, Speech Processing and Computer Vision. We investigate the power of deep learning methods in the context of regulatory genomics and develop novel learning frameworks for integrating key functional genomic data types. Our primary objective is to decipher the relationships between regulatory sequence, transcription factor binding, nucleosome positioning, chromatin accessibility and histone modifications. First, using extensive simulations of regulatory DNA sequence, we evaluate the ability of deep CNNs and CNN-RNNs trained on raw sequence to learn different properties of transcription factor binding sites including probabilistic affinity to sequence motifs, positional and density distributions of motifs, combinatorial sequence grammars involving co-factor sequence preferences with spacing and order constraints. We leverage these architectures in a multi-task setting to learn predictive models of in-vivo TF binding from ChIP-seq data for a large compendium of TFs across multiple cell types and tissues. Our results demonstrate significantly superior generalization performance of deep learning methods, especially CNN-RNNs compared to state-of-the-art approaches for modeling TF binding within and across cell types. We further develop novel methods for model exploration, visualization and feature selection to dissect the heterogeneity of the sequence code underlying direct and indirect TF binding. Next we investigate the relationship between chromatin accessibility, nucleosome positioning and chromatin state (histone marks). We train multi-task, multi-modal CNNs on a novel two-dimensional representation of ATAC-seq data that leverages subtle patterns in insert-size distributions to simultaneously predict multiple histone modifications, combinatorial chromatin state and CTCF binding sites with high accuracy. Models trained on a combination of DNase-seq and MNase-seq data achieve even higher accuracy supporting a fundamental predictive mapping between local chromatin architecture and chromatin state. We use novel feature extraction and visualization methods to peer into the deep neural networks and identify predictive patterns reminiscent of nucleosomal asymmetry and TF footprints. Finally, we will discuss general strategies and easy-to-use software packages for rapid prototyping and learning of optimal deep architectures from functional genomic data.

**Bio:** Anshul Kundaje is an Assistant Professor of Genetics and Computer Science at Stanford University and a 2014 Alfred Sloan Fellow. His primary research interest is computational regulatory genomics. His lab develops statistical and machine learning methods for large-scale integrative analysis of diverse functional genomic data to decipher heterogeneity of regulatory elements, uncover their long-range interactions in the context of 3D genome organization, learn transcriptional regulatory network models across cell-types and understand the system-level regulatory impact of non-coding genetic variation. Anshul has previously led the computational analysis efforts of The Encyclopedia of DNA Elements (ENCODE) Project and the Roadmap Epigenomics Project

Thu 09/24/15, 01:30pm, Whitehead 304**Challenges in Graph-Based Machine Learning and Robustifying Data Graphs with Scalable Local Spectral Methods***Michael Mahoney, UC Berkeley*

**Abstract:** Graphs are very popular ways to model data in many data analysis and machine learning applications, but they can be quite challenging to work with, especially when they are very sparse, as is typically the case. We will discuss challenges we have encountered in working with large sparse graphs in machine learning and data analysis applications and in particular in the construction of these graphs, e.g., with various sorts of popular nearest neighbor rules applied to feature vectors. In our experience, many properties of the constructed graphs are very sensitive to seemingly-minor and often-ignored aspects of the graph construction process. This should suggest caution in using popular algorithmic and statistical tools, e.g., popular nonlinear dimensionality reduction methods, in trying to extract insight from those constructed graphs. We will also describe recent results on using local spectral methods to robustify this graph construction process. Local spectral methods use locally-biased random walks, they have had several remarkable successes in worst-case algorithm design as well as in analyzing the empirical properties of large social and information networks, and they are an example of a worst-case approximation algorithm that implicitly but exactly implements a form of statistical regularization. Informally, the reason for the successes of these methods in robustifying graph construction is that these local random walks provide a regularized or stable version of an eigenvector, and initial results on using these ideas to robustify the graph construction process are promising.

(an ML talk in the AMS Speaker Series)

Fri 09/11/15, 12:00pm, Hackerman B17**Machine Reading for Cancer Panomics***Hoifung Poon, Microsoft Research*

**Abstract:** Advances in sequencing technology have made available a plethora of panomics data for cancer research, yet the search for disease genes and drug targets remains a formidable challenge. Biological knowledge such as pathways can play an important role in this quest by constraining the search space and boosting the signal-to-noise ratio. The majority of knowledge resides in text such as journal articles, which has been undergoing its own explosive growth, making it mandatory to develop machine reading methods for automating knowledge extraction. In this talk, I will formulate the machine reading task for pathway extraction, review the state of the art and open challenges, and present our Literome project and latest attack to the problem based on grounded semantic parsing.

**Bio:** Hoifung Poon is a researcher at Microsoft Research. His research interests lie in advancing machine learning and natural language processing (NLP) to help automate discovery in genomics and precision medicine. His most recent work focuses on scaling semantic parsing to PubMed for extracting biological pathways, and on developing probabilistic methods to incorporate pathways with high-throughput omics data in cancer systems biology. He has received Best Paper Awards in premier NLP and machine learning venues such as the Conference of the North American Chapter of the Association for Computational Linguistics, the Conference of Empirical Methods in Natural Language Processing, and the Conference of Uncertainty in AI.

(an ML talk in the CLSP Speaker Series)

Tue 09/08/15, 11:00am, Sherwood Room, Levering Hall**Individualized Prognosis of Diseas Trajectories: Application to Scleroderma***Suchi Saria, Johns Hopkins University*

**Abstract:** For many complex diseases, there is a wide variety of ways in which an individual can manifest the disease. The challenge of personalized medicine is to develop tools that can accurately predict the trajectory of an individual’s disease. Access to such tools can help clinicians tailor therapy to the individual. We propose a hierarchical latent variable model that shares statistical strength across observations at different resolutions–the population, subpopulation and the individual level. We describe an algorithm for learning population and subpopulation parameters offline, and an online procedure for dynamically learning individual-specific parameters. We validate our model on the task of predicting the course of interstitial lung disease, one of the leading causes of death among patients with the autoimmune disease scleroderma. We demonstrate significant improvements in predictive accuracy over state of the art. This is joint work with Peter Schulam (PhD student), Colin Ligon (clinical fellow), and Fredrick Wigley and Laura Hummers at the Hopkins Scleroderoma Center.

Fri 08/28/15, 12:00pm, Hackerman B17**I-Vector Representation Based on GMM and DNN for Audio Classification***Najim Dehak, MIT, CSAIL*

**Abstract:** The I-vector approach became the state of the art approach in several audio classification tasks such as speaker and language recognition. This approach consists of modeling and capturing all the different variability in the Gaussian Mixture Model (GMM) mean components between several audio recordings. More recently several subspace approaches had been extended on modeling the variability between the GMM weights rather than the GMM means. These last techniques such as Non-negative Factor Analysis (NFA) and Subspace Multinomial Model (SMM) needed to deal with the fact that the GMM weights are always positive and they should sum to one. In this talk, we will show how the NFA and SMM approaches or similar other subspaces approaches can be also used to model the hidden layer neuron activations on the deep neural network model for sequential data recognition task such as language and dialect recognition.

**Bio:** Najim Dehak received his Engineering degree in Artificial Intelligence in 2003 from Universite des Sciences et de la Technologie d’Oran, Algeria, and his MS degree in Pattern Recognition and Artificial Intelligence Applications in 2004 from the Universite de Pierre et Marie Curie, Paris, France. He obtained his Ph.D. degree from Ecole de Technologie Superieure (ETS), Montreal in 2009. During his Ph.D. studies he was also with Centre de recherche informatique de Montreal (CRIM), Canada. In the summer of 2008, he participated in the Johns Hopkins University, Center for Language and Speech Processing, Summer Workshop. During that time, he proposed a new system for speaker verification that uses factor analysis to extract speaker-specific features, thus paving the way for the development of the i-vector framework. Dr. Dehak is currently a research scientist in the Spoken Language Systems (SLS) Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests are in machine learning approaches applied to speech processing and speaker modeling. The current focus of his research involves extending the concept of an i-vector representation into other audio classification problems, such as speaker diarization, language- and emotion-recognition.

(an ML talk in the CLSP Speaker Series)

Tue 04/21/15, 03:00pm, Arellano Theater**Sensitivity Analysis of Neuronal Behaviors***Rodolphe Sepulchre, Cambridge University, England*

**Abstract:** Reliable neuron activity is ensured by a tight regulation of ion channels expression. Understanding the causal mechanisms that relate this regulation to physiological and pathological neuronal activity is a necessary step for developing efficient therapies for neurological diseases associated with abnormal nervous system activity. We will discuss a novel methodological framework to quantify the sensitivity of neuronal activity to changes in ion channel densities, either from a detailed conductance-based model or from voltage-clamped experimental data. We will illustrate the generality of the framework and its potential to improve our understanding of the regulation of brain functions and to help in the design of new pharmacological treatments.

Fri 04/17/15, 12:00pm, Room W3030, School of Public Health**Missing Data as a (Restricted) Causal Inference Problem***Ilya Shpitser, University of Southhampton*

**Abstract:** Causal inference is often phrased as a missing data problem — for every unit, only the response to observed treatment assignment is known, the response to other treatment assignments is not. At the same time, conditions under which valid analysis in the presence of missingness is possible using the data actually observed are often phrased in terms of “missingness mechanisms,”which are really special kinds of causal mechanisms, although this connection is often kept implicit. In this talk, I discuss some recent work where missing data problems are explicitly represented as a (possibly restricted) graphical causal model. This representation allows us to leverage existing theory on identification of causal effects to give a general criterion for cases where a joint distribution containing missing variables can be recovered from data actually observed. This criterion is significantly more general than the commonly used “missing at random” (MAR) criterion. In fact, the relationship of this criterion to MAR is not unlike the relationship between Tian’s algorithm for identification of causal effects, and conditional ignorability. This is a joint work with Karthika Mohan and Judea Pearl.

(an ML talk in the Biostatistics Speaker Series)

Thu 04/16/15, 10:45am, Hackerman Hall B17**Generalized Independence Constraints: Models and Inference***Ilya Shpitser, University of Southampton*

**Abstract:** Many datasets are plagued by unobserved confounders: hidden but relevant variables. The presence of hidden variables obscures many conditional independence constraints in the underlying model, and greatly complicates data analysis. In this talk I consider a type of equality constraint which generalizes conditional independence, and which is a “natural” equality constraint for data generated from the marginal distribution of a DAG graphical model. I discuss applications of such constraints to recovering causal structure from data, and statistical inference in hidden variable models. To this end, I introduce a new kind of graphical model, called the nested Markov model, which is defined via these constraints just as Bayesian networks and Markov random fields are defined via conditional independence constraints. I describe parameterizations for nested Markov models with discrete state spaces, together with parameter and structure learning algorithms. I show cases where a single generalized equality constraint is sufficient to completely recover a nested Markov model, and thus the corresponding family of hidden variable DAGs. Part of this material is based on joint work with Thomas S. Richardson, James M. Robins, and Robin J. Evans. Part of this material is based on joint work with Judea Pearl.

**Bio:** Ilya Shpitser is a Lecturer in Statistics at the University of Southampton. His interests include inference in complex multivariate settings, causal inference from observational data, particularly in longitudinal settings, and foundational issues in causality. Previously, Ilya was a research fellow at the causal inference group headed by James M. Robins at the Harvard School of Public Health. He received his PhD under the supervision of Judea Pearl at UCLA.

(an ML talk in the CS Speaker Series)

Wed 04/15/15, 10:30am, COE Stieff Building, North Conference Room**Incorporating Compositional and Relational Semantics into Word Representation Learning***Kevin Duh, Nara Institute of Science and Technology*

**Abstract:** I will discuss algorithms for learning vector representations of words. Recent work has shown that such vector representations, also known as word embeddings, can successfully capture semantic and syntactic regularities in language (Mikolov2013) and improve the performance of various Natural Language Processing systems, including information extraction (Turian2010, Wang2013), semantic role labeling (Collobert2011), and sentiment analysis (Socher2013). Although many algorithms for learning word representations have been proposed, most are based on the same premise of “distributional semantics,” where words from similar contexts are mapped to nearby vectors. This embodies J. R. Firth’s dictum: “You shall know a word by the company it keeps.” However, distributional semantics is by no means the only way to model word meaning. In this talk, I shall argue that “compositional semantics” and “relational semantics” also need to be incorporated into representation learning. First, I will present a neural language model that dynamically re-computes its word representations on-the-fly; this helps account for the compositional semantics phenomena of new meaning formation and meaning shift. Second, I will describe a representation learning algorithm that incorporates relational semantics from WordNet and other knowledge bases. This algorithm, based on the Alternating Direction Method of Multipliers (ADMM), provides a simple yet flexible approach to integrating multiple types of semantic premises into word representation learning.

**Bio:** Kevin Duh is an assistant professor at the Nara Institute of Science and Technology (NAIST), Graduate School of Information Science. He received his B.S. in 2003 from Rice University, and PhD in 2009 from the University of Washington, both in Electrical Engineering. Prior to joining NAIST, he worked at the NTT Communication Science Laboratories (2009-2012). His research interests lie at the intersection of Natural Language Processing and Machine Learning, in particular on areas relating to machine translation, semi-supervised learning, and deep learning. Website: http://cl.naist.jp/~kevinduh

(an ML talk in the HLTCOE Speaker Series)

Wed 04/01/15, 12:00pm, Hackerman Hall B17**Active Information Acquisition with Mobile Robots and Configurable Sensing Systems***George J. Pappas, University of Pennsylvania*

**Abstract:** As the world is getting instrumented with numerous sensors, cameras, and robots, there is potential to transform industries as diverse as environmental monitoring, search and rescue, security and surveillance, localization and mapping, and structure inspection. Successful estimation techniques for gathering information in these scenarios have been designed and implemented. However, one of the great technical challenges today is to intelligently control the sensors, cameras, and robots in order to extract information actively and autonomously. In this talk, I will present a unified approach for active information acquisition, aimed at improving the accuracy and efficiency of tracking evolving phenomena of interest. I formulate a decision problem for maximizing relevant information measures and focus on the design of scalable control strategies for multiple sensing systems. First, I will present a greedy approach for information acquisition via applications in source seeking and mobile robot localization. Next, information acquisition with a longer planning horizon will be considered in the context of linear Gaussian models. I will develop an approximation algorithm with suboptimality guarantees to reduce the complexity in the planning horizon and the number of sensors and will present an application to active multi-robot localization and mapping. Finally, non-greedy information acquisition with general sensing models will be used for active object recognition. The techniques presented in this talk offer an effective and scalable approach for controlled information acquisition with multiple sensing systems.

**Bio:** George J. Pappas is the Joseph Moore Professor and Chair of the Department of Electrical and Systems Engineering at the University of Pennsylvania. He also holds a secondary appointment in the Departments of Computer and Information Sciences, and Mechanical Engineering and Applied Mechanics. He is member of the GRASP Lab and the PRECISE Center. He has previously served as the Deputy Dean for Research in the School of Engineering and Applied Science. His research focuses on control theory and in particular, hybrid systems, embedded systems, hierarchical and distributed control systems, with applications to unmanned aerial vehicles, distributed robotics, green buildings, and biomolecular networks. He is a Fellow of IEEE, and has received various awards such as the Antonio Ruberti Young Researcher Prize, the George S. Axelby Award, the O. Hugo Schuck Best Paper Award, and the National Science Foundation PECASE.

(an ML talk in the LCSR Speaker Series)

Wed 04/01/15, 12:00pm, Malone 228**Bandits with Resource Constraints***Alex Slivkins, Microsoft Research*

**Abstract:** Multi-armed bandits is the predominant theoretical model for exploration-exploitation tradeoff in machine learning, with countless applications ranging from medical trials, to communication networks, to Web search and advertising, to dynamic pricing. In many of these application domains the learner may be constrained by one or more supply/budget limits, in addition to the customary limitation on the time horizon. We introduce a general model that encompasses such problems, combining aspects of stochastic integer programming with online learning. A distinctive feature (and challenge) in our model, compared to the conventional bandit problems, is that the optimal policy for a given problem instance may significantly outperform the policy that always chooses the best fixed action. Our main result is an algorithm with near-optimal regret relative to the optimal policy. Also, we extend this result to contextual bandits, and detail an application to dynamic pricing.

Mon 03/30/15, 12:15pm, Room W3008, School of Public Health**Effective Connectivity and Dynamic Directional Model for ECoG Data***Dr. Tingting Zhang, Department of Statistics, University of Virginia*

**Abstract:** We introduce a dynamic directional model (DDM) for studying brain effective connectivity based on intracranial electrocorticographic (ECoG) time series. The DDM consists of two parts: a set of differential equations describing neuronal activity of brain components (state equations), and observation equations linking the underlying neuronal states to observed data. When applied to functional MRI or EEG data, DDMs usually have complex formulations and thus can accommodate only a few regions, due to limitations in spatial resolution and/or temporal resolution of these imaging modalities. In contrast, we formulate our model in the context of ECoG data. The combined high temporal and spatial resolution of ECoG data result in a much simpler DDM, allowing investigation of complex connections between many regions. To identify functionally segregated sub-networks, a form of biologically economical brain networks, we propose the Potts model for the DDM parameters. The neuronal states of brain components are represented by cubic spline bases and the parameters are estimated by minimizing a log-likelihood criterion that combines the state and observation equations. The Potts model is converted to the Potts penalty in the penalized regression approach to achieve sparsity in parameter estimation, for which a fast iterative algorithm is developed. The methods are applied to an auditory ECoG dataset.

(an ML talk in the Biostatistics Speaker Series)

Fri 03/27/15, 12:00pm, Hackerman B17**Recursive Estimation of Multivariate Markov Processes***Yariv Ephraim, George Mason University*

**Abstract:** A multivariate Markov process is a vector of random processes which are jointly, but not necessarily individually, Markov. Multivariate Markov processes form a rich family of mathematically tractable models, which have been found useful in numerous applications. Examples include the hidden Markov process in discrete-time, and the Markov modulated Poisson process in continuous-time. Multivariate Markov processes lend themselves to the mathematical tractability of Markov processes while allowing non- Markovian features for the individual process components. For example, while the distribution of the sojourn time of a multivariate Markov chain in each of its states is geometric or exponential, the distribution of the sojourn time of an individual process component in each of its states is phase-type. In this talk we present forward recursions for estimating some statistics of bivariate Markov chains, and of bivariate Markov chains observed through memoryless channels. We demonstrate the use of these forward recursions in an online algorithm for estimating the parameter of the process. Recursions for discrete-time as well as continuous-time multivariate Markov processes will be discussed. This presentation is based on joint work with Brian L. Mark. This work was supported in part by the National Science Foundation under grant CCF- 0916568.

**Bio:** Yariv Ephraim received the D.Sc. degree in electrical engineering from the Technion-Israel Institute of Technology, Haifa, Israel, in 1984. From 1984-1985 he was a Rothschild Postdoctoral Fellow at the Information Systems Laboratory, Stanford University, Stanford, CA. From 1985 to 1993 he was a Member of Technical Staff at the Information Principles Research Laboratory, AT&T Bell Laboratories, Murray Hill, NJ. In 1991 he joined George Mason University, Fairfax, VA, where he currently is Professor of Electrical and Computer Engineering.

(an ML talk in the CLSP Speaker Series)

Tue 03/24/15, 10:45am, Hackerman Hall B17**Generalizability in Causal Inference***Elias Bareinboim, University of California, Los Angeles*

**Abstract:** Empirical scientists seek not just surface descriptions of the observed data, but deeper explanations of why things happened the way they did, and how the world would be like had things happened differently. With the unprecedented accumulation of data (or, “big data”), researchers are becoming increasingly aware of the fact that traditional statistical techniques, including those based on artificial intelligence and machine learning, must be enriched with two additional ingredients in order to construct such explanations: 1. the ability to integrate data from multiple, heterogeneous sources, and 2. the ability to distinguish causal from associational relationships. In this talk, I will present a theory of causal generalization that provides a principled way for fusing pieces of empirical evidence coming from multiple, heterogeneous sources. I will first introduce a formal language capable of encoding the assumptions necessary to express each problem instance. I will then present conditions and algorithms for deciding whether a given problem instance admits a consistent estimate for the target effects and, if feasible, fuse information from various sources to synthesize such an estimate. These results subsume the analyses conducted in various fields in the empirical sciences, including “external validity,” “meta-analysis,” “heterogeneity,” “quasi-experiments,” “transportability,” and “sampling selection bias.” I will conclude by presenting new challenges and opportunities opened by this research.

**Bio:** Elias Bareinboim is a postdoctoral scholar (and was a Ph.D. student) in the Computer Science Department at the University of California, Los Angeles, working with Judea Pearl. His interests are in causal and counterfactual inferences and their applications. He is also broadly interested in artificial intelligence, machine learning, robotics, and philosophy of science. His doctoral thesis provides the first general framework for solving the generalizability problems in causal inference — which has applications across all the empirical sciences. Bareinboim’s recognitions include the Dan David Prize Scholarship, the Yahoo! Key Scientific Challenges Award, the Outstanding Paper Award at the 2014 Annual Conference of the American Association for Artificial Intelligence (AAAI), and the Edward K. Rice Outstanding Graduate Student.

(an ML talk in the CS Speaker Series)

Mon 03/23/15, 12:15pm, Room W3008, School of Public Health**Bayes and Big Data: The Consensus Monte Carlo Algorithm***Steven L. Scott, Alexander Blocker, Fernando Bonassi*

**Abstract:** A useful definition of “big data” is data that is too big to comfortably process on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be eliminated by splitting data across multiple machines. Communication between large numbers of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging individual Monte Carlo draws across machines. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Examples of consensus Monte Carlo are shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).

(an ML talk in the Biostatistics Speaker Series)

Fri 03/20/15, 12:00pm, Hackerman Hall B17**Ensembles for the Discovery of Compact Structures in Data***Madalina Fiterau, Carnegie Mellon University*

**Abstract:** In many practical scenarios, complex high-dimensional data contains low-dimensional structures that could be informative of the analytic problems at hand. I will present a method that detects such structures if they exist, and uses them to construct compact interpretable models for different machine learning tasks that can benefit practical applications. To start with, I will formalize Informative Projection Recovery, the problem of extracting a small set of low-dimensional projections of data that jointly support an accurate model for a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to multiple types of machine learning problems, offering solutions to classification, clustering, regression, and active learning tasks. Experiments show that our method can discover and leverage low-dimensional structures in data, yielding accurate and compact models. Our method is particularly useful in applications in which expert assessment of the results is of the essence, such as classification tasks in the healthcare domain.

**Bio:** Madalina Fiterau is a PhD student in Machine Learning at Carnegie Mellon University and a member of the Auton Lab. She is advised by Prof. Artur Dubrawski. Her research interests include query-specific models for decision support systems, learning with structured sparsity, dimensionality reduction in an active learning setting and anomalous pattern detection. She received her BE in Computer Engineering from the Politehnica University of Timisoara.

(an ML talk in the CLSP Speaker Series)

Wed 03/18/15, 12:30pm, Clark 314**Toward Large-Scale Human Behavior Analysis***Minh Hoai Nguyen, Stony Brook University*

**Abstract:** Enabling computers to understand human behavior has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human-computer interaction, and social robotics. Critical to the understanding of human behavior is the ability to detect humans and recognize their actions. In the first part of my talk, I will present an algorithm to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard arrangements due to cinematic style, and I take advantage of this to provide scene context. In the second part of this talk, I will describe an algorithm to harness the power of big data for large-scale human action recognition. In particular, I will present Instance Ranking SVM (IRSVM), a novel Multiple Instance Learning (MIL) formulation for learning from data with little annotation. IRSVM outperforms existing MIL algorithms by embracing the correlation of data instances and anticipating the deficiency of the instance classifier.

**Bio:** Minh Hoai Nguyen is an Assistant Professor of Computer Science at Stony Brook University. Prior to Stony Brook, he was a Nicholas Kurti junior research fellow at Brasenose College, Oxford University. He received a Bachelor of Engineering from The University of New South Wales in 2006 and a Ph.D. in Robotics from Carnegie Mellon University in 2012. His research interests are in Computer Vision and Machine Learning. In 2012, Nguyen and his coauthor received the Best Student Paper Award at the IEEE Conference On Computer Vision and Pattern Recognition (CVPR).

(an ML talk in the CIS Speaker Series)

Tue 03/10/15, 01:30pm, Clark 314**Hamiltonian Monte Carlo in Computational Anatomy***Christof Seiler, Stanford University – Department of Statistics*

**Abstract:** The aim of Computational Anatomy (CA) is to describe human anatomy not by what it looks like but by how it deforms. Statistical analysis of medical image deformations can be used to classify and predict diseases. We approach CA from a Bayesian perspective by sampling deformations from a posterior distribution. To sample from such high dimensional distributions we resort to Markov chain Monte Carlo methods. My talk has two parts. First, we investigate the theoretical properties of one sampler, Hamiltonian Monte Carlo (HMC), known to work well in practice. We reformulate HMC in the language of Riemannian geometry and calculate sectional curvatures to bound its running time. Second, we apply HMC to analyze the geometry of lower back and abdominal pain in 400 patients. This is joint work with Susan Holmes, Simon Rubinstein-Salzedo, Xavier Pennec, and Nicolas Bronsard.

**Bio:** I am a SNSF Postdoctoral Fellow mentored by Susan Holmes at Stanford University in the Department of Statistics. In 2012, I obtained my joint PhD co-advised by Xavier Pennec (Asclepios Team, Inria, France) and Mauricio Reyes (ISTB, University of Bern, Switzerland). My main research focus is inference in medical imaging analysis in the context of psychiatry and the musculoskeletal system.

(an ML talk in the CIS Speaker Series)

Fri 02/13/15, 12:15pm, Room W3008, School of Public Health**Robust Covariance Functional Inference***Fang Han, JHU, Department of Biostatistics*

**Abstract:** Covariance functional inference plays a key role in high dimensional statistics. A wide variety of statistical methods, including principal component analysis, Gaussian graphical model estimation, and multiple linear regression, are intrinsically inferring covariance functionals. In this talk, I will present a unified framework for analysis of complex (non-Gaussian, heavy-tailed, dependent, high dimensional data. It connects covariance functional inference to robust statistics. Within this unified framework, I will introduce three new methods: elliptical component analysis, robust homogeneity test, and rank-based estimation of latent VAR model. They cover both estimation and testing problems in high dimensions and are applicable to independent or time series data. Although the generative models are complex, we show the rather surprising result that all proposed methods are minimax optimal and their performance is comparable to Gaussian-based counterparts under the Gaussian assumption. We further illustrate the strength of the proposed unified framework on real datasets.

(an ML talk in the Biostatistics Speaker Series)

Tue 02/10/15, 01:30pm, Clark 314**Operator Splitting and Optimization***Wotao Yin, UCLA, Department of Mathematics*

**Abstract:** Operator splitting schemes break a complicated optimization problem with multiple cost functions and/or constraints into very simple components, to which basic (sub)gradient, projection, and proximal operations are applied. The resulting algorithms are often short, easy to code, and have (nearly) state-of-the-art performance. This talk will review three basic operator splitting schemes and introduce a new splitting scheme. Their special cases cover a large number of existing algorithms such as von Neumann’s alternating projection, iterative soft-thresholding algorithm, ADMM, and various primal-dual algorithms. Through examples, we demonstrate that they lead to high-performance low-cost methods for large-scale optimization problems.

**Bio:** Wotao Yin is a professor in the Department of Mathematics at UCLA. His research interests lie in computational optimization and its applications in image processing, machine learning, and other inverse problems. He received his B.S. in mathematics from Nanjing University in 2001, and then M.S. and Ph.D. in operations research from Columbia University in 2003 and 2006, respectively. During 2006 – 2013, he was with Rice University. He won NSF CAREER award in 2008 and Alfred P. Sloan Research Fellowship in 2009. His recent work has been in optimization algorithms for large-scale and distributed signal processing and machine learning problems. His talk is related to a joint work with Damek Davis, which won the 2014 INFORMS Optimization Society Best Student Paper Award.

(an ML talk in the CIS Speaker Series)

Tue 02/10/15, 10:45am, Hackerman B17**Towards Scalable Analysis of Images and Videos***Eric Xing, Carnegie Mellon University*

**Abstract:** With the prevalence of mobile and wearable cameras and video-recorders, and global deployment of surveillance systems in space, air, sea, ground, and social network, the amount of unprocessed images and videos is massive, calling for a need for effective computational means for automatic analysis, understanding, summarization, and organization of visual data. In this talk, I will present some of our recent work on scalable machine learning approaches to image and video understanding. Specifically, I will focus on structured multitask methods for large-scale image classification, and online latent space methods for video analysis and summarization. This is joint work with Bin Zhao.

**Bio:** Dr. Eric Xing is a Professor of Machine Learning in the School of Computer Science at Carnegie Mellon University. His principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in social and biological systems. Professor Xing received his Ph.D. in Computer Science from UC Berkeley. He is an associate editor of the Annals of Applied Statistics (AOAS), the Journal of American Statistical Association (JASA), the IEEE Transaction of Pattern Analysis and Machine Intelligence (PAMI), the PLoS Journal of Computational Biology, and an Action Editor of the Machine Learning Journal (MLJ), the Journal of Machine Learning Research (JMLR). He is a member of the DARPA Information Science and Technology (ISAT) Advisory Group, and a Program Chair of ICML 2014.

(an ML talk in the CS Speaker Series)

Wed 01/28/15, 12:15pm, Room W3008, School of Public Health**Instrumental Variables and Mendelian Randomization with Invalid Instruments***Hyunseung Kang, Statistics Department, University of Pennsylvania*

**Abstract:** Instrumental variables have been widely used for estimating the causal effect of an exposure on an outcome. Conventional estimation methods require complete knowledge about all the instruments’ validity; a valid instrument must not have a direct effect on the outcome and not be related to unmeasured confounders. Often, this is impractical as highlighted by Mendelian randomization studies where genetic markers are used as instruments and complete knowledge about instruments’ validity is equivalent to complete knowledge about the involved genes’ functions. In this talk, we propose a method for estimation of causal effects when this complete knowledge is absent. First, we show that the causal effects are identified and can be estimated as long as less than 50% of the instruments are invalid, without knowing which of the instruments are invalid. We also present necessary and sufficient conditions for identification when the 50% threshold is violated. Second, we introduce a fast penalized L1 estimation method, called sisVIVE, to estimate the causal effect without knowing which instruments are valid, with theoretical guarantees on its performance. Third, we propose a simple and robust method to estimate confidence intervals in this context by utilizing a unique feature of the Anderson-Rubin confidence interval (Anderson and Rubin, 1949). Finally, we demonstrate the proposed methods are demonstrated on simulated data and a real Mendelian randomization study. An R package sisVIVE is available online.

(an ML talk in the Biostatistics Speaker Series)

Tue 01/27/15, 01:30pm, Whitehead 304**Tuning Parameters in High-Dimensional Statistics***Johannes Lederer, Cornell University*

**Abstract:** High-dimensional statistics is the basis for analyzing large and complex data sets that are generated by cutting-edge technologies in genetics, neuroscience, astronomy, and many other fields. However, Lasso, Ridge Regression, Graphical Lasso, and other standard methods in high-dimensional statistics depend on tuning parameters that are difficult to calibrate in practice. In this talk, I present two novel approaches to overcome this difficulty. My first approach is based on a novel testing scheme that is inspired by Lepski’s idea for bandwidth selection in non-parametric statistics. This approach provides tuning parameter calibration for estimation and prediction with the Lasso and other standard methods and is to date the only way to ensure high performance, fast computations, and optimal finite sample guarantees. My second approach is based on the minimization of an objective function that avoids tuning parameters altogether. This approach provides accurate variable selection in regression settings and, additionally, opens up new possibilities for the estimation of gene regulation networks, microbial ecosystems, and many other network structures.

(an ML talk in the AMS Speaker Series)

Fri 01/23/15, 12:15pm, Room W3030, School of Public Health**Quantifying Ozone-Related Mortality Under Climate Chance: Methods to Incorporate Uncertainty in Future Ozone Exposures***Stacey Alexeeff, National Center for Atmospheric Research*

**Abstract:** Climate change is expected to have many impacts on the environment, including changes in ozone concentrations at the surface level. A key public health concern is the potential increase in ozone-related summertime mortality if surface ozone concentrations rise in response to climate change. Previous health impact studies have not incorporated the variability of ozone into their prediction models. We propose a Bayesian posterior analysis and Monte Carlo estimation method for quantifying health effects of future ozone. The key features of our methodology are (i) the propagation of uncertainty in both the health effect and the ozone projections and (ii) use of the empirical distribution of the daily ozone projections to account for their variation. We use interpolation to improve the accuracy of averaging ozone exposures over the irregular shaped county regions where mortality and demographic information is reported. In addition, we compare our thin plate spline interpolation to other linear interpolators. By carefully staging our computations and using efficient and parallel data analysis tools we are able to handle a very large volume model output and still do relevant computations on a daily time scale. We quantify the expected change in ozone-related summertime mortality in the contiguous United States between 2000 and 2050 under a changing climate, and we compare two future emissions scenarios.

(an ML talk in the Biostatistics Speaker Series)

Mon 01/05/15, 12:15pm, Room W3030, School of Public Health**Optimal Inference After Model Selection***William Fithian, Stanford University*

**Abstract:** To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful. Exploiting the classical theory of Lehmann and Scheffe (1955), we derive most powerful unbiased selective tests and confidence intervals for inference in exponential family models after arbitrary selection procedures. For linear regression, we derive new selective z-tests that generalize recent proposals for inference after model selection and improve on their power, and new selective t-tests that do not require knowledge of the error variance. Based on joint work with Dennis Sun and Jonathan Taylor.

(an ML talk in the Biostatistics Speaker Series)

Fri 12/19/14, 12:00am, Room W3008, School of Public Health**Methods for Quantifying Conflict Casualties in Syria***Rebecca C. Steorts, Carnegie Mellon University*

**Abstract:** Information about social entities is often spread across multiple large databases, each degraded by noise, and without unique identifiers shared across databases. Record linkage—reconstructing the actual entities and their attributes—is essential to using big data and is challenging not only for inference but also for computation. In this talk, I motivate record linkage by the current conflict in Syria. It has been tremendously well documented, however, we still do not know how many people have been killed from conflict-related violence. We describe a novel approach towards estimating death counts in Syria and challenges that are unique to this database. We first introduce a novel approach to record linkage by discovering a bipartite graph, which links manifest records to a common set of latent entities. Our model quantifies the uncertainty in the inference and propagates this uncertainty into subsequent analyses. We then introduce computational speed-ups to avoid all-to-all record comparisons based upon locality-sensitive hashing from the computer science literature. Finally, we speak to the success and challenges of solving a problem that is at the forefront of national headlines and news.

(an ML talk in the Biostatistics Speaker Series)

Thu 12/18/14, 09:00pm, 1st Floor Hackerman Hall**Data Mining Poster Presentations**

**Abstract:** The students of the Data Mining course 550.436 Fall 2014 will present their project with posters on Thursday 12/18 9-10:30 and Friday 12/19 9-10:30 and 10:45-12:15 on the first floor of Hackerman Hall. You are welcome to come over and and see their work. This course is an introduction to Machine Learning. Most students are either master students or young graduate students or undergraduate students. For most of them, this is their first poster presentation. They will demonstrate the proper use and respective performances of various classification algorithms on the data of their choice.

(an ML talk in the AMS Speaker Series)

Thu 12/11/14, 10:30am, Malone 228**Mediation Analysis: Theory and Methods***Ilya Shpitser, University of Southampton*

**Abstract:** The goal of causal inference is the discovery of cause effect relationships from observational data, using appropriate assumptions. Two innovations that proved key for this task are a formal representation of potential outcomes under a random treatment assignment (due to Neyman), and viewing cause effect relationships via directed acyclic graphs (due to Wright). Using a modern synthesis of these two ideas, I consider the problem of mediation analysis which decomposes an overall causal effect into component effects corresponding to particular causal pathways. Simple mediation problems involving direct and indirect effects and linear models were considered by Baron and Kenny in the 1980s, and a significant literature has been developed since. In this talk, I consider mediation analysis at its most general: I allow arbitrary models, the presence of hidden variables, multiple outcomes, longitudinal treatments, and effects along arbitrary sets of causal pathways. There are three distinct but related problems to solve — a representation problem (what sort of potential outcome does an effect along a set of paths correspond to), an identification problem (can a causal parameter of interest be expressed as a functional of observed data), and an estimation problem (what are good ways of estimating the resulting statistical parameter). I report a complete solution to the first two problems, and progress on the third. In particular, I show that for some parameters that arise in mediation settings a triply robust estimator exists, which relies on an outcome model, a mediator model, and a treatment model, and which remains consistent if any two of these three models are correct. Some of the reported results are a joint work with Eric Tchetgen Tchetgen, Caleb Miles, Phyllis Kanki, and Seema Meloni.

**Bio:** Ilya Shpitser is a Lecturer in Statistics at the University of Southampton. Previously, he was a Research Associate at the Harvard School of Public Health, working in the causal inference group with James M. Robins, Tyler VanderWeele, and Eric Tchetgen Tchetgen. His dissertation work was done at UCLA under the supervision of Judea Pearl. The fundamental question driving his research is this: “what makes it possible (or impossible) to infer cause-effect relationships?” Ilya received Ph.D. in Computer Science from UCLA in 2008. He then did a postdoctoral fellowship in the causal inference group at the Harvard School of Public Health until 2012.

(an ML talk in the CS Speaker Series)

Wed 12/03/14, 12:00pm, Hackerman Hall B27**Model-Based Tracking Using 2D and 3D Visual Information***Henrik I. Christensen, Georgia Tech*

**Abstract:** As robotic systems are moving from well controlled settings to unstructured environments, they are required to operate in dynamic and cluttered scenes. Finding an object, estimating its pose, and tracking the pose over time in these scenes are challenging problems. Although various approaches have tackled these problems, their scope of objects and robustness of their solutions are still limited. We focus on object perception using visual sensory information, which spans from the monocular camera to the recently appeared RGB-D sensor, and address four important challenges related to the topic of 6-DOF object pose estimation and tracking in unstructured environments. A large number of 3D object models are widely available in online object model databases, and these object models have significant prior information which includes geometric shapes and photometric appearance. We note that using both geometric and photometric attributes available from the models enables to handle both textured and textureless objects. We present efforts to broaden the spectrum of objects by combining geometric and photometric features. Another challenge is how to dependably estimate and track the pose of an object in spite of clutter in the background. The difficulties of object perception mainly depend on the degree of clutter. The background clutter is likely to lead to false measurements, and the wrong measurements tend to result in inaccurate pose estimates. We present two multiple pose hypotheses frameworks: a particle filtering framework for tracking and a voting framework for pose estimation.

**Bio:** Dr. Henrik I. Christensen is the KUKA Chair of Robotics at the College of Computing Georgia Institute of Technology. He is also the executive director of the Institute for Robotics and Intelligent Machines (IRIM). Dr. Christensen does research on systems integration, human-robot interaction, mapping and robot vision. The research is performed within the Cognitive Robotics Laboratory. He has published more than 300 contributions across AI, robotics and vision. His research has a strong emphasis on “real problems with real solutions”. A problem needs a theoretical model, implementation, evaluation, and translation to the real world. He is actively engaged in the setup and coordination of robotics research in the US (and worldwide). Dr. Christensen received the Engelberger Award 2011, the highest honor awarded by the robotics industry. He was also awarded the “Boeing Supplier of the Year 2011″ with 3 other colleagues at Georgia Tech. Dr. Christensen is a fellow of American Association for Advancement of Science. He received an honorary doctorate in engineering from Aalborg University 2014. He collaborates with institutions and industries across three continents.

(an ML talk in the LCSR Speaker Series)

Tue 11/25/14, 10:30am, Hodson 213**Multiscale Geometric Methods for Statistical Learning and Data in High-Dimensions***Mauro Maggioni, Duke University*

**Abstract:** We discuss a family of ideas, algorithms, and results for analyzing various new and classical problems in the analysis of high-dimensional data sets. These methods rely on the idea of performing suitable multiscale geometric decompositions of the data, and exploiting such a decomposition to perform a variety of tasks in signal processing and statistical learning. In particular, we discuss the problem of dictionary learning, where one is interested in constructing, given a training set of signals, a set of vectors (dictionary) such that the signals admit a sparse representation in terms of the dictionary vectors. We discuss a multiscale geometric construction of such dictionaries, its computational cost and online versions, and finite sample guarantees on its quality. We then generalize part of this construction to other tasks, such as learning an estimator for the probability measure generating the data, again with fast algorithms with finite sample guarantees, and for learning certain types of stochastic dynamical system in high-dimensions. Applications to construction of multi-resolution dictionaries for images will be discussed.

(an ML talk in the AMS Speaker Series)

Thu 11/20/14, 01:30pm, Maryland 110**The Revival of Coordinate Descent Methods***Stephen Wright, University of Wisconsin-Madison*

**Abstract:** The approach of minimizing a function by successively fixing most of its variables and minizing with respect to the others dates back many years, and has been applied in an enormous range of applications. Until recently, however, the approach did not command much respect among optimization researchers; only a few prominent individual took it seriously enough to analyze its properties. Recent years have seen an explosion in applications, particularly in data analysis, which has driven a new wave of research into variants of coordinate descent and their convergence properties. Such aspects as randomization in the choice of variants to fix and relax, acceleration methods, extension to regularized objectives, and parallel implementation have commanded a good deal of attention during the past five years. In this lecture, I will survey these recent developments, then focus on recent work on asynchronous parallel implementations for multicore computers. An analysis of the properties of the latter algorithms shows that near-linear speedup can be expected, up to a number of processors that depends on the coupling between the variables. This talk covers joint work with Ji Liu and other colleagues.

**Bio:** Stephen J. Wright is a Professor of Computer Sciences at the University of Wisconsin-Madison. His research interests lie in computational optimization and its applications to many areas of science and engineering. Prior to joining UW-Madison in 2001, Wright was a Senior Computer Scientist at Argonne National Laboratory (1990-2001), and was a Professor of Computer Science at the University of Chicago (2000-2001). During 2007-2010, he served as chair of the Mathematical Optimization Society, and he served the maximum three terms as an elected member of the Board of Trustees of the Society for Industrial and Applied Mathematics (SIAM). He is a Fellow of SIAM. He also serves on the Science Advisory Board of the Institute for Pure and Applied Mathematics. Wright is the author or coauthor of widely used text / reference books in optimization including “Primal Dual Interior-Point Methods” (SIAM, 1997) and “Numerical Optimization” (2nd Edition, Springer, 2006, with J. Nocedal). He has also authored 100 refereed journal papers on optimization theory, algorithms, software, and applications, along with 60 refereed conference papers and book chapters. He is coauthor of widely used software for linear programming (PCx) and quadratic programming (OOQP) based on interior-point methods, and GPSR and SpaRSA for compressed sensing. His paper on SpaRSA won the Baker Paper Prize from IEEE for best paper in any archival publication of that society in the period 2009-2011. Prof. Wright is editor-in-chief of the SIAM Journal on Optimization, and is an associate editor of Mathematical Programming, Series A. He has previously served as editor-in-chief of Mathematical Programming, Series B, and section editor of SIAM REview

(an ML talk in the AMS Speaker Series)

Tue 11/18/14, 12:00pm, Hackerman Hall B17**The Unreasonable Effectiveness of Deep Learning***Yann LeCun, Facebook*

**Abstract:** The emergence of large datasets, parallel computers, and new machine learning methods, have enabled the deployment of highly-accurate computer perception systems and are opening the door to a wide deployment of AI systems. A key component in AI systems is a module, sometimes called a feature extractor, that turns raw inputs into suitable internal representations. But designing and building such a module requires a considerable amount of engineering efforts and domain expertise. Deep Learning methods have provided a way to automatically learn good representations of data from labeled or unlabeled samples. Deep architectures are composed of successive stages in which data representations are increasingly global, abstract, and invariant to irrelevant transformations of the input. Deep learning enables end-to-end training of these architectures, from raw inputs to ultimate outputs. The convolutional network model (ConvNet) is a particular type of deep architecture somewhat inspired by biology, which consists of multiple stages of filter banks, interspersed with non-linear operators, and spatial pooling. ConvNets have become the record holder for a wide variety of benchmarks, including object detection, localization and recognition in image, semantic segmentation and labeling, face recognition, acoustic modeling for speech recognition, drug design, handwriting recognition, biological image segmentation, etc. The most recent systems deployed by Facebook, Google, NEC, IBM, Microsoft, Baidu, Yahoo and others for image understanding, speech recognition, and natural language processing use deep learning. Many of these systems use very large and very deep ConvNets with billions of connections, trained in supervised mode. But many new applications require the use of unsupervised feature learning. A number of such methods based on sparse auto-encoder will be presented. Several applications will be shown through videos and live demos, including a category-level object recognition system that can be trained on the fly, a scene parsing system that can label every pixel in an image with the category of the object it belongs to (scene parsing), an object localization and detection system, and several natural language processing systems. Specialized hardware architectures that run these systems in real time will also be described.

**Bio:** Yann LeCun is Director of AI Research at Facebook, and Silver Professor of Data Science, Computer Science, Neural Science, and Electrical Engineering at New York University, affiliated with the NYU Center for Data Science, the Courant Institute of Mathematical Science, the Center for Neural Science, and the Electrical and Computer Engineering Department. He received the Electrical Engineer Diploma from Ecole Supérieure d’Ingénieurs en Electrotechnique et Electronique (ESIEE), Paris in 1983, and a PhD in Computer Science from Université Pierre et Marie Curie (Paris) in 1987. After a postdoc at the University of Toronto, he joined AT&T Bell Laboratories in Holmdel, NJ in 1988. He became head of the Image Processing Research Department at AT&T Labs-Research in 1996, and joined NYU as a professor in 2003, after a brief period as a Fellow of the NEC Research Institute in Princeton. From 2012 to 2014 he directed NYU’s initiative in data science and became the founding director of the NYU Center for Data Science. He was named Director of AI Research at Facebook in late 2013 and retains a part-time position on the NYU faculty. His current interests include AI, machine learning, computer perception, mobile robotics, and computational neuroscience. He has published over 180 technical papers and book chapters on these topics as well as on neural networks, handwriting recognition, image processing and compression, and on dedicated circuits and architectures for computer perception. The character recognition technology he developed at Bell Labs is used by several banks around the world to read checks and was reading between 10 and 20% of all the checks in the US in the early 2000s. His image compression technology, called DjVu, is used by hundreds of web sites and publishers and millions of users to access scanned documents on the Web. Since the mid 1980’s he has been working on deep learning methods, particularly the convolutional network model, which is the basis of many products and services deployed by companies such as Facebook, Google, Microsoft, Baidu, IBM, NEC, AT&T and others for image and video understanding, document recognition, human-computer interaction, and speech recognition. LeCun has been on the editorial board of IJCV, IEEE PAMI, and IEEE Trans. Neural Networks, was program chair of CVPR’06, and is chair of ICLR 2013 and 2014. He is on the science advisory board of Institute for Pure and Applied Mathematics, and Neural Computation and Adaptive Perception Program of the Canadian Institute for Advanced Research. He has advised many large and small companies about machine learning technology, including several startups he co-founded. He is the lead faculty at NYU for the Moore-Sloan Data Science Environment, a $36M initiative in collaboration with UC Berkeley and University of Washington to develop data-driven methods in the sciences. He is the recipient of the 2014 IEEE Neural Network Pioneer Award.

(an ML talk in the CLSP Speaker Series)

Tue 11/11/14, 01:30pm, Clark 314**Bringing Structure to Network Analysis***Blair D. Sullivan, North Carolina State University*

**Abstract:** The ability to analyze and interpret information from graphs modeling relationships in large datasets has become increasingly important in recent years, with the rapid growth of available data in domains from neuroscience and genomics to social networking and commercial marketing. Unfortunately, traditional graph algorithms often fail to scale (as theproblems analysts want solved are NP-hard), and ad hoc heuristics havedominated the landscape. In a land far away (from the data), theoretical computer scientists and mathematicians have been working for years on algorithms and approaches that promise to solve many NP-hard graph problems in polynomial time under certain structural assumptions. Unfortunately, direct application is typically infeasible (due to issues like hidden constants in the complexity and extreme sensitivity to noise in the data). This talk will give a (relatively gentle) introduction to some of these structural assumptions, their implications for fast algorithms, and why direct application fails. Since failure is often not an option, we also describe our recent work on bridging the gap by integrating ideas from parameterized complexity into real-world network analysis — including algorithmic advances, classification of random graph models using the sparse graph hierarchy, and empirical evaluation of network structure.

**Bio:** Dr. Blair D. Sullivan (http://www.csc.ncsu.edu/faculty/bdsullivan/) directs the Theory in Practice group in the Department of Computer Science at North Carolina State University, which focuses on transforming techniques from theoretical computer science into practical, scalable tools for graph analysis. Her research combines expertise in structural graph theory, efficient algorithms, and combinatorial scientific computing with interdisciplinary collaborations to address fundamental problems in network science. Dr. Sullivan’s recent work highlights the potential for structure-based graph algorithms in data-driven science such as characterizing sparsity of key network models, and novel open-source tree decomposition algorithms which are compatible with parallel architectures. Dr. Sullivan is a Moore Investigator in Data-Driven Discovery, National Consortium for Data Science Faculty Fellow and holds a joint faculty appointment in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. Dr. Sullivan earned her PhD from Princeton University and her BS from the Georgia Institute ofTechnology.

(an ML talk in the CIS Speaker Series)

Mon 11/10/14, 12:15pm, Room W3008, School of Public Health**Bayesian Latent Factor Models Recover Gene Networks and Expression QTLs***Barbara Engelhardt, Princeton University*

**Abstract:** Latent factor models have been the recent focus of much attention in `big data’ applications because of their ability to quickly allow the user to explore the underlying data in a controlled and interpretable way. In genomics, latent factor models are commonly used to identify population substructure, identify gene clusters, and control noise in large data sets. In this talk I present a general framework for Bayesian latent factor models. I will motivate some of the structural extensions to these models that have been proposed by my group. I will illustrate the power and the promise of these models for a much broader class of problems in genomics through two specific application to the Genotype-tissue Expression (GTEx) data set. First, I will show how this class of statistical models can be used to identify gene co-expression networks that co-vary uniquely in one tissue type. Second, I will show how these models can be used to identify pleiotropic expression QTLs, or genetic variants that jointly regulate the transcription levels of multiple genes. I will close by describing other uses for these models on genomic applications.

(an ML talk in the Biostatistics Speaker Series)

Tue 10/28/14, 01:30pm, Clark 314**Feature Selection with Annealing for Big Data Learning***Adrian Barbu, Department of Statistics, Florida State University*

**Abstract:** Many computer vision and medical imaging problems are faced with learning from large-scale datasets, with millions of observations and features. This work presents a novel efficient learning scheme that tightens a sparsity constraint by gradually removing variables based on a criterion and a schedule. The attractive fact is that the problem size keeps dropping throughout the iterations, which makes it particularly suitable for big data learning. This approach applies generically to the optimization of any differentiable loss function, and finds applications in regression, classification and ranking. The resultant algorithms build variable screening into estimation and are extremely simple to implement, and have theoretical guarantees of convergence and selection consistency. In addition, one dimensional piecewise linear response functions are used to account for nonlinearity and a second order prior is imposed on these functions to avoid overfitting. Experiments on real and synthetic data show that the proposed method compares very well with other state of the art methods in regression, classification and ranking while being computationally very efficient and scalable

**Bio:** Adrian Barbu received his BS degree from University of Bucharest, Romania, in 1995, a Ph.D. in Mathematics from Ohio State University in 2000 and a Ph.D. in Computer Science from UCLA in 2005. From 2005 to 2007 he was a research scientist and later project manager in Siemens Corporate Research, working in medical imaging. He received the 2011 Thomas A. Edison Patent Award with his co-authors for their work on Marginal Space Learning. From 2007 he joined the Statistics department at Florida State University, first as assistant professor, and since 2013 as associate professor. His research interests are in computer vision, machine learning and medical imaging.

(an ML talk in the CIS Speaker Series)

Thu 10/23/14, 01:30pm, Maryland 110**Generative Models for Image Analysis***Lo-Bin Chang, Johns Hopkins University*

**Abstract:** A probabilistic grammar for the grouping and labeling of parts and objects, when taken together with pose and part-dependent appearance models, constitutes a generative scene model and a Bayesian framework for image analysis. To the extent that the generative model generates features, as opposed to pixel intensities, the posterior distribution (i.e. the conditional distribution on part and object labels given the image) is based on incomplete information; feature vectors are generally insufficient to recover the original intensities. I will propose a way to learn pixel-level models for the appearances of parts. I will demonstrate the utility of the models with some experiments in image sampling and image classification.

(an ML talk in the AMS Speaker Series)

Tue 10/21/14, 10:30am, Hackerman B17**Complexity and Compositionality***Alan Yuille, University of California, Los Angeles*

**Abstract:** A fundamental problem of vision is how to deal with the astronomical complexity of images, scenes, and visual tasks. For example, considering the enormous input space of images and output space of objects, how can a human observer obtain a coarse interpretation of an image within less than 150 msec? And how can the observer, given more time, be able to parse the image into its components (objects, object parts, and scene structures) and reason about their relationships and actions? The same complexity problem arguably arises in most aspects of intelligence and addressing it is critical to understanding the brain and to designing artificial intelligence systems. This talk describes a research program which addresses this problem by using hierarchical compositional models which represent objects, and scene structures, in terms of elementary components which can be grouped together to form more complex structures, shared between different objects, and which are represented more abstractly in summary form. This program is illustrated by examples including: (i) low-level representations of images, (ii) segmentation and bottom-up attentional mechanisms, (iii) detection and parsing objects, (iv) estimating the 3D shapes of objects and scene structures from single images. We briely discuss ongoing work that relates these models to experimental studies of the brain, including psychophysics, electrophysiology, and fMRI.

**Bio:** Professor Yuille is the Director of the UCLA Center for Cognition, Vision, and Learning, as well as a Professor at the UCLA Department of Statistics, with courtesy appointments at the Departments of Psychology, Computer Science, and Psychiatry. He is affiliated with the UCLA Staglin Center for Cognitive Neuroscience, the NSF Center for Brains, Minds and Machines, and the NSF Expedition in Visual Cortex On Silicon. His undergraduate degree was in Mathematics and his Phd in Theoretical Physics, both at the University of Cambridge. He has held appointments at MIT, Harvard, the Smith-Kettlewell Eye Research Institute, and UCLA. His research interests include computer vision, cognitive science, neural network modeling and machine learning. He has over three hundred peer reviewed publications. He has won several awards including the Marr prize and the Helmholtz test of time award. He is a fellow of IEEE.

(an ML talk in the CS Speaker Series)

Mon 10/20/14, 12:15pm, Room W308, School of Public Health**Recent Advances in Deep Learning: Learning Structured, Robust and Multimodal Models***Ruslan Salakhutdinov, University of Toronto*

**Abstract:** Building intelligent systems that are capable of extracting meaningful representations from high-dimensional data lies at the core of solving many Artificial Intelligence tasks, including visual object recognition, information retrieval, speech perception, and language understanding. In this talk I will first introduce a broad class of hierarchical deep learning models and show that they can learn useful hierarchical representations from large volumes of high- dimensional data with applications in information retrieval, object recognition, and speech perception. I will then describe a new class of more complex models that combine deep learning models with structured hierarchical Bayesian models and show how these models can learn a deep hierarchical structure for sharing knowledge across hundreds of visual categories. Finally, I will introduce deep models that are capable of extracting a unified representation that fuses together multiple data modalities as well as discuss models that can generate natural language descriptions of images. I will show that on several tasks, including modelling images and text, video and sound, these models significantly improve upon many of the existing techniques.

(an ML talk in the Biostatistics Speaker Series)

Tue 10/14/14, 01:30pm, Clark 314**Convex Biclustering***Eric Chi, Digital Signal Processing Group, Rice University*

**Abstract:** In the biclustering problem, we seek to simultaneously group observations and features. While biclustering has applications in a wide array of domains, ranging from text mining to collaborative filtering, the problem of identifying structure in high dimensional genomic data motivates this work. In this context, biclustering enables us to identify subsets of genes that are co-expressed only within a subset of experimental conditions. We present a convex formulation of the biclustering problem that possesses a unique global minimizer and an iterative algorithm, COBRA, that is guaranteed to identify it. Our approach generates an entire solution path of possible biclusters as a single tuning parameter is varied. We also show how to reduce the problem of selecting this tuning parameter to solving a trivial modification of the convex biclustering problem. The key contributions of our work are its simplicity, interpretability, and algorithmic guarantees – features that arguably are lacking in the current alternative algorithms. We demonstrate the advantages of our approach, which includes stably and reproducibly identifying biclusterings, on simulated and real microarray data.

**Bio:** Eric Chi received his Ph.D. in Statistics from Rice University. After completing his Ph.D. he worked as a postdoctoral researcher with Kenneth Lange in the Human Genetics department at UCLA. He is currently a postdoctoral researcher working with Richard Baraniuk in the DSP group at Rice University. His research interests are in numerical optimization and its application to machine learning approaches for exploring large and complicated modern data.

(an ML talk in the CIS Speaker Series)

Mon 10/13/14, 12:15pm, Room W3008, School of Public Health**Time-Varying Networks Estimation and Dynamic Model Selection***Annie Qu, University of Illinois at Urbana-Champaign*

**Abstract:** In many biomedical and social science studies, it is important to identify and predict the dynamic changes of associations among network data over time. We propose a varying-coefficient model to incorporate time-varying network data, and impose a piecewise-penalty function to capture local features of the network associations. The advantages of the proposed approach are that it is nonparametric and therefore flexible in modeling dynamic changes of association for network data problems, and capable of identifying the time regions when dynamic changes of associations occur. To achieve local sparsity of network estimation, we implement a group penalization strategy involving overlapping parameters among different groups. However, this imposes great challenges in the optimization process for handling large-dimensional network data observed at many time points. We develop a fast algorithm, based on the smoothing proximal gradient method, which is computationally efficient and accurate. We illustrate the proposed method through simulation studies and children’s attention deficit hyperactivity disorder fMRI data, and show that the proposed method and algorithm efficiently recover dynamic network changes over time. This is joint work with Xinxin Shu and Lan Xue.

(an ML talk in the Biostatistics Speaker Series)

Thu 10/09/14, 01:30pm, Maryland 110**Coordinate Descent Methods for Modern Optimization Problems***Rachael Tappenden, Johns Hopkins University*

**Abstract:** `Big data’ is a topical area of research, where many interesting optimization applications arise. Applications include compressed sensing, machine learning, truss topology design and matrix completion, to name a few. Optimization algorithms that work well on medium or large scale problems, may not be suitable in this `big data’ context because of the time and memory required for their implementation. (Block) Coordinate Descent (CD) methods seem to fill this niche because they are simple, have low memory requirements, and, at any iteration, do not usually require access to the data in its entirety. In this talk I will discuss some recent work on randomized block coordinate descent methods, including the incorporation of partial second order information, the implementation of inexact updates, and the use of varying blocks.

(an ML talk in the AMS Speaker Series)

Tue 10/07/14, 12:00pm, Hackerman B17**Single-Channel Mixed Speech Recognition Using Deep Neural Networks***Dong Yu, Microsoft Research*

**Abstract:** While significant progress has been made in improving the noise robustness of speech recognition systems, recognizing speech in the presence of a competing talker remains one of the most challenging unsolved problems in the field. In this talk, I will present our first attempt in attacking this problem using deep neural networks (DNNs). Our approach adopted a multi-style training strategy using artificially mixed speech data. I will discuss the strengths and weaknesses of several different setups that we have investigated including a WFST-based two-talker decoder to work with the trained DNNs. Experiments on the 2006 speech separation and recognition challenge task demonstrate that the proposed DNN-based system has remarkable robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an overall WER of 18.8% which improves upon the results obtained by the state-of-the-art IBM superhuman system by 2.8% absolute, with fewer assumptions.

**Bio:** Dr. Dong Yu is a principal researcher at the Microsoft speech and dialog research group. His research interests include speech processing, robust speech recognition, discriminative training, and machine learning. He has published over 130 papers in these areas and is the co-inventor of more than 50 granted/pending patents. His recent work on the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) has been shaping the direction of research on large vocabulary speech recognition and was recognized by the IEEE SPS 2013 best paper award.

(an ML talk in the CLSP Speaker Series)

Tue 10/07/14, 10:30am, Hackerman Hall B17**From Sensation to Conception: Theoretical Perspectives on Multisensory Perception and Cross-Modal Transfer***Robert Jacobs, University of Rochester*

**Abstract:** If a person is trained to recognize or categorize objects or events using one sensory modality, the person can often recognize or categorize those same (or similar) objects and events via a novel modality, an instance of cross-modal transfer of knowledge. How is this accomplished? The Multisensory Hypothesis states that people extract the intrinsic, modality-independent properties of objects and events, and represent these properties in multisensory representations. These representations mediate the transfer of knowledge across modality-specific representations. In this talk, I’ll present three studies, using experimental and computational methodologies, of the Multisensory Hypothesis. The first study examines visual-haptic transfer of object shape knowledge, the second study examines visual-auditory transfer of sequence category knowledge, and the final study examines a novel latent variable model of multisensory perception.

**Bio:** For my undergraduate studies, I attended the University of Pennsylvania where I majored in Psychology. I spent the next two years working as a Research Assistant in a biomedical research laboratory at Rockefeller University. For graduate school, I attended the University of Massachusetts at Amherst where I earned a Ph.D. degree in Computer and Information Science (graduate advisor: Andrew Barto). I then served in two postdoc positions, one in the Department of Brain & Cognitive Sciences at the Massachusetts Institute of Technology (postdoc advisor: Michael Jordan), and the other in the Department of Psychology at Harvard University (postdoc advisor: Stephen Kosslyn). I’m currently a faculty member at the University of Rochester where my title is Professor of Brain & Cognitive Sciences, of Computer Science, and of the Center for Visual Science. I am also a member of the Center for Computation and the Brain.

(an ML talk in the CS Speaker Series)

Mon 10/06/14, 12:15pm, Room W3008, School of Public Health**Analytical Approaches for Characterizing the Diffusion of New Medical Technologies***Sharon-Lise T. Normand, Harvard Medical School & Harvard School of Public Health*

**Abstract:** The last two decades have been characterized by dramatic changes in the use of atypical antipsychotic drugs, most of which are prescribed by psychiatrists and financed by public programs. In this talk, approaches to summarizing antipsychotic drug prescribing behaviors for three new therapeutically-similar drugs using dispensing information for nearly 17,000 U.S. physicians between January 1, 1997 and December 31, 2007 are described. While logistic models are commonly used to study the diffusion path of a new technology, several features of prescription data complicate inferences. These include time-varying drug choice sets due to different launch dates, semi-continuous response data, and multivariate outcomes. We begin with examining time to first prescription of a new antipsychotic, estimating diffusion paths separately for each new drug using non-parametric approaches. Next, we estimate multivariate survival models to identify fixed physician characteristics related to adoption time. When then utilize all antipsychotic prescription data and estimate key parameters of the diffusion path for each physician individually. The physician-specific parameters are combined using Bayesian multivariate factor analysis approaches to provide a parsimonious representation of drug prescribing behaviors. This work is funded, in part, by grants U01-MH103018 and R01-MH093359, both from the National Institute Mental Health.

(an ML talk in the Biostatistics Speaker Series)

Thu 10/02/14, 01:30pm, Maryland 110**The Game of 20 Questions with (1) Noisy Answers and (2) Multiple Targets: A Delight of Information Theory, Probability, Control, and Computer Vision***Bruno Jedynak, Johns Hopkins University*

**Abstract:** We will explore various instances of the game of 20 questions with special interest in the situations where (1) the responses are noisy and (2) there are multiple targets. We will discuss adaptive as well as non-adaptive policies. We will study performance and optimality for an information theoretic cost function. Application in fast face detection, micro-surgical tool tracking, and human vision will be briefly presented.

(an ML talk in the AMS Speaker Series)

Thu 10/02/14, 10:30am, Hackerman B17**What are Boundedly Rational Mechanisms for Language and Active Perception?***Richard Lewis, University of Michigan*

**Abstract:** Characterizing human language processing as rational probabilistic inference has yielded a number of useful insights. For example, surprisal theory (Hale, Levy) represents an elegant formalization of incremental processing that has met with empirical success (and some challenges) in accounting for word-by-word reading times. A theoretical challenge now facing the field is integrating rational analyses with bounded computational/cognitive mechanisms, and with task-oriented perception and action. A standard approach to such challenges (Marr and others) is to posit (bounded) mechanisms/algorithms that approximate functions specified at a rational analysis level. I discuss an alternative approach, computational rationality, that incorporates the bounds themselves in the definition of rational problems of utility maximization. This approach naturally admits of two kinds of analyses: the derivation of control strategies (policies or programs) for bounded agents that are optimal in local task settings, and the identification of processing mechanisms that are optimal across a broad range of tasks. As an instance of the first kind of analysis, we consider the derivation of eye-movement strategies in a simple word reading task, given general assumptions about noisy lexical representations and oculomotor architecture. These analyses yield novel predictions of task and payoff effects on fixation durations that we have tested and confirmed in eye-tracking experiments. (The model can be seen as a kind of ideal-observer/actor model, and naturally extends to account for distractor-ratio and pop-out effects in visual search). As an instance of the second kind of analysis, we consider properties of an optimal short-term memory system for sentence parsing, given general assumptions about noisy representations of linguistic features. Such a system provides principled explanations of similarity-based interference slow-downs and certain speed-accuracy tradeoffs in sentence processing. I conclude by sketching steps required for an integrated theory that jointly derives task-driven parsing and eye-movement strategies constrained by noisy memory and perception.

**Bio:** Richard Lewis is a cognitive scientist at the University of Michigan, where he is Professor of Psychology and Linguistics. He received his PhD in Computer Science at Carnegie Mellon with Allen Newell, followed by a McDonnell Fellowship in Psychology at Princeton and a position as Assistant Professor of Computer Science at Ohio State. His research interests include sentence processing, eye-movements, short-term memory, cognitive architecture, reinforcement learning and intrisic reward, and optimal control approaches to modeling human behavior. He was elected a Fellow of the Association for Psychological Science in 2010.

(an ML talk in the CS Speaker Series)

Tue 09/30/14, 01:30pm, Clark 314**Deeply-Supervised Nets***Zhuowen Tu, University of California, San Diego*

**Abstract:** We present a new deep learning method, deeply-supervised nets (DSN), which simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We focus our attention on three specific issues present in existing convolutional-neural-network-type (CNN-type) architectures: (1) lack of transparency in the effect that intermediate layers have on the overall classification; (2) impeded training effectiveness in the face of the “exploding” and “vanishing” gradients; (3) at-times-limited discrimination and robustness of learned features, especially in early network layers. To combat these issues, we introduce “companion” objective functions at each individual hidden layer; we also maintain an overall objective function at the output layer. We show how this strategy distinctly differs from earlier approaches involving layer-wise pre-training and we also analyze our algorithm using techniques extended from stochastic gradient methods. The advantages provided by our method are readily apparent in our experimental results on benchmark datasets, showing significant performance gains over existing methods; for example, we attain state-of-the-art performance on MNIST, CIFAR-10, CIFAR-100, and SVHN.

**Bio:** Zhuowen Tu is an assistant professor in the Department of Cognitive Science and also affiliated with the Department of Computer Science and Engineering , University of California, San Diego. Before joining UCSD, he was a faculty member at UCLA. Between 2011 and 2013, he took a leave to work at Microsoft Research Asia. He received his Ph.D. from the Ohio State University. His research interests span computer vision, machine learning, neuro imaging, and medical informatics.

(an ML talk in the CIS Speaker Series)

Tue 09/30/14, 10:30am, Hackerman Hall B17**Synergies in Word Learning***Mark Johnson, Macquarie University*

**Abstract:** As far as we know, no single kind of cue carries sufficient information to enable a language to be successfully learnt, so some kind of cue integration seems essential. This talk uses computational models to study how a diverse range of information sources can be exploited in word learning. I describe a non-parametric Bayesian framework called Adaptor Grammars, which can express computational models that exploit information ranging from stress cues through to discourse and contextual cues for learning words. We use these models to compare two different approaches a learner could use to acquire a language. A staged learner learns different aspects of a language independently of each other, while a joint learner learns them simultaneously. A joint learner can take advantage of synergistic dependencies between linguistic components in ways that a staged learner cannot. By comparing “minimal pairs” of models we show that there are interactions between the non-local context, syllable structure and the lexicon that a joint learner could synergistically exploit. This suggests that it would be advantageous for a language learner to integrate different kinds of cues according to Bayesian principles. We end with a discussion of the broader implications of a non-parametric Bayesian approach, and survey other applications of these techniques.

**Bio:** Mark Johnson is a Professor of Language Science (CORE) in the Department of Computing at Macquarie University, and is Director of the Macquarie Centre for Language Sciences. He received a BSc (Hons) in 1979 from the University of Sydney, an MA in 1984 from the University of California, San Diego and a PhD in 1987 from Stanford University. He held a postdoctoral fellowship at MIT from 1987 until 1988, and has been a visiting researcher at the University of Stuttgart, the Xerox Research Centre in Grenoble, CSAIL at MIT and the Natural Language group at Microsoft Research. He has worked on a wide range of topics in computational linguistics, and is mainly known for his work on syntactic parsing and its applications to text and speech processing. Recently he has developed non-parametric Bayesian models of human language acquisition. He was President of the Association for Computational Linguistics in 2003 and will be President of ACL’s SIGDAT (the organisation that runs EMNLP) in 2015, and was a professor from 1989 until 2009 in the Departments of Cognitive and Linguistic Sciences and Computer Science at Brown University.

(an ML talk in the CS Speaker Series)

Thu 09/25/14, 01:30pm, Maryland 110**Predictive Monitoring and Analytics for Physiologic Signals***Douglas Lake, University of Virginia*

**Abstract:** The process of translating mathematical models of physiological dynamics to clinical utility can be a very difficult and lengthy undertaking. For example, the recent success of a clinical trial showing that heart rate monitoring of the risk of sepsis leads to reduced mortality in neonates culminated more than 10 years after the initial observation of the physiologic phenomenon. This presentation presents some lessons learned and other examples of current methods to develop predictive models from a statistical, systems engineering, and theoretical mathematical perspective. In particular, methods and models to detect atrial fibrillation (AF) developed on a data base of 2500 24-hour Holter monitor recordings are discussed. Entropy rate estimates measured with sample entropy (SampEn) is an important feature of these models. A critical part of translating theoretical concepts like entropy rate is developing signal processing algorithms that are robust to the noise of “real world” data and can be interpreted correctly by clinicians. This is especially true for the problem of avoiding inappropriate shocks in an implantable cardioverter defibrillator (ICD) where decisions can be made based on as few as 10 consecutive RR (inter-beat) intervals measured in milliseconds. The development of accurate entropy estimates requires a solid theoretical mathematical framework that includes the fields of Renyi entropy and kernel density estimation. A final product of this approach is a new algorithm called the coefficient of sample entropy (COSEn) which has recently been shown to accurately discriminate 79% (19/24) of inappropriate shocks because of AF in a large ICD study at Johns Hopkins University.

(an ML talk in the AMS Speaker Series)

Wed 09/24/14, 12:00pm, Malone 328**Correctness Protection via Differential Privacy***Aaron Roth, University of Pennsylvania*

**Abstract:** False discovery is a growing problem in scientific research. Despite sophisticated statistical techniques for controlling the false discovery rate and related statistics designed to protect against spurious discoveries, there is significant evidence that many published scientific papers contain incorrect conclusions. In this talk we consider the role that adaptivity has in this problem. A fundamental disconnect between the theorems that control false discovery rate and the practice of science is that the theorems assume a fixed collection of hypotheses to be tested, selected non-adaptively before the data is gathered, whereas science is by definition an adaptive process, in which data is shared and re-used, while hypotheses are generated after seeing the results of previous tests. We note that false discovery cannot be prevented when a substantial number of adaptive queries are made to the data, and data is used naively — i.e. when queries are answered exactly with their empirical estimates on a given finite data set. However we show that remarkably, there is a different way to evaluate statistical queries on a data set that allows even an adaptive analyst to make exponentially many queries to the data set, while guaranteeing that with high probability, all of the conclusions he draws generalize to the underlying distribution. This technique counter-intuitively involves actively perturbing the answers given to the data analyst, using techniques developed for privacy preservation — but in our application, the perturbations are added entirely to increase the utility of the data. Joint work with Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, and Omer Reingold.

(an ML talk in the CS Speaker Series)

Tue 09/23/14, 10:30am, Hackerman B17**Probabilistic Models of Human Language Comprehension, Production and Acquisition***Roger Levy, University of California, San Diego*

**Abstract:** Human language acquisition and use are central problems for the advancement of machine intelligence, and pose some of the deepest scientific challenges in accounting for the capabilities of the human mind. In this talk I describe several major advances we have recently made in this domain made possible by combining leading ideas and techniques from computer science and cognitive science. Central to these advances is the use of generative probabilistic models over richly structured linguistic representations. In language comprehension, I describe how we have used these models to develop detailed theories of incremental parsing that unify the central problems of ambiguity resolution, prediction, and syntactic complexity, and that yield compelling quantitative fits to behavioral data from both controlled psycholinguistic experiments and reading of naturalistic text. I also describe noisy-channel models relating the accrual of uncertain perceptual input with sentence-level language comprehension that account for critical outstanding puzzles for previous theories, and that when combined with reinforcement learning yield state-of-the-art models of human eye movement control in reading. This work on comprehension sets the stage for a theory in language production of how speakers tend toward an optimal distribution of information content throughout their utterances, whose predictions we confirm in statistical analysis of a variety of types of optional function word omission. Finally, I conclude with examples of how we use nonparametric models to account for some of the most challenging problems in language acquisition, including how humans learn phonetic category inventories and acquire and rank phonological constraints.

**Bio:** Roger Levy is Associate Professor of Linguistics at the University of California, San Diego, where he directs the world’s first Computational Psycholinguistics Laboratory. He received his B.S. from the University of Arizona and his M.S. and Ph.D. from Stanford University. He was a UK ESRC Postdoctoral Fellow at the University of Edinburgh before his current appointment. His awards include an NSF CAREER grant, an Alfred P. Sloan Research Fellowship, and a Fellowship at the Center for Advanced Study in the Behavioral Sciences. Levy’s research program is devoted to theoretical and applied questions at the intersection of cognition and computation, focusing on human language processing and acquisition. Inherently, linguistic communication involves the resolution of uncertainty over a potentially unbounded set of possible signals and meanings. How can a fixed set of knowledge and resources be acquired and deployed to manage this uncertainty? To address these questions Levy uses a combination of computational modeling and psycholinguistic experimentation. This work furthers our foundational understanding of linguistic cognition, and helps lay the groundwork for future generations of intelligent machines that can communicate with humans via natural language.

(an ML talk in the CS Speaker Series)

Wed 09/17/14, 12:00pm, Malone 228**New Algorithms for Learning Incoherent and Overcomplete Dictionary***Rong Ge, Microsoft Research New England*

**Abstract:** In sparse recovery we are given a matrix A (“the dictionary”) and a vector of the form AX where X is sparse. and the goal is to recover X. This is a central notion in signal processing, statistics and machine learning. But in applications such as sparse coding, the dictionary A is unknown and has to be learned from random examples of the form Y = AX where X is drawn from an appropriate distribution — this is the dictionary learning problem. In most settings, A is overcomplete: it has more columns than rows. This talk presents a polynomial-time algorithm for learning overcomplete dictionaries; Our algorithm applies to incoherent dictionaries which have been a central object of study since they were introduced in seminal work of Donoho and Huo.

**Bio:** Rong Ge is currently a post-doc at Microsoft Research, New England. He received his Ph.D. in Princeton University, advised by Prof. Sanjeev Arora. His main research interest is in applying algorithm design techniques from theoretical computer science to machine learning problems, with the hope of provable algorithms and better understanding of the machine learning models.

(an ML talk in the CS Speaker Series)

Wed 09/03/14, 12:00pm, Sherwood Theater, Levering Hall**Machine Learning from Electronic Health Records***C. David Page, University of Wisconsin-Madison*

**Abstract:** Machine learning (ML) and healthcare have the potential to transform one another and, in doing so, to take society with them. This talk will give evidence for that claim by focusing on applications of machine learning to EHR data, to build predictive models for cardiac events and adverse drug events, and to build tools for breast cancer screening. The talk will discuss lessons from these applications for healthcare. It also will discuss lessons from these applications for machine learning, including twists on the widely-studied problems of learning causal models, learning from longitudinal and relational data, learning from weakly-labeled data, and the interplay of privacy and learning.

**Bio:** David Page received his Ph.D. in computer science from the University of Illinois at Urbana-Champaign in 1993. He was a post-doc in the Oxford University Computing Laboratory from 1993 to 1997 (visiting faculty ’95-’97), and an assistant professor in the University of Louisville’s Speed School of Engineering (’97-’99). David is now a professor at the University of Wisconsin-Madison, in the Dept. of Biostatistics and Medical Informatics within the School of Medicine and Public Health, with an appointment also in the Dept. of Computer Sciences. He directs the Cancer Informatics Shared Resource of the University of Wisconsin’s Carbone Cancer Center and is a member of the Genome Center of Wisconsin. He served on the NIH’s BioData Management and Analysis Study Section when it first became a standing study section, and has previously served on the scientific advisory boards for the Wisconsin Genomics Initiative and the Observational Medical Outcomes Partnership, focused on detecting adverse drug events. He serves on the editorial boards for Machine Learning and Data Mining and Knowledge Discovery. He currently holds NIH grants on Machine Learning for Adverse Drug Events, Secure Sharing of Clinical and Genetic Data, and RNAseq-based Prediction of Developmental Neural Toxicity, and he also participates in multiple grants applying machine learning to mammography and to mass spectrometry proteomics data.

Mon 08/04/14, 12:00pm, Arellano Theater**Discovery and Optimization of Dynamic Treatment Regimes through Reinforcement Learning***Joelle Pineau, McGill University*

**Abstract:** In this talk we will explore new algorithmic methods for automatically discovering and optimizing sequential treatments for chronic and life-threatening diseases. The first part will discuss the problem of designing adaptive multi-stage clinical trials. This can be cast as a multi-arm bandit problem, for which we present new optimization criteria designed to efficiently acquire data to learn personalized treatment rules. The second part will focus on the problem of using data collected in multi-stage sequential trials to automatically generate treatment strategies that are tailored to patient characteristics and time-dependent outcomes. In this case, we leverage models and algorithms from the reinforcement learning literature to perform comparison of different treatment sequences. The methods will be illustrated using our recent work on learning adaptive neurostimulation policies for the treatment of epilepsy. Brief examples will be drawn from some of our other projects, including developing dynamic treatment regimes for mental illness, diabetes and cancer.

Thu 04/24/14, 01:30pm, Shaffer 101**Random Time Changes in Quantitative Finance***Rafael Mendoza-Arriaga, UT Austin, McCombs School of Business*

**Abstract:** Subjecting a stochastic process to a random time change is a classical technique in stochastic analysis. In this talk we survey our recent work on using random time changes as a flexible model-building tool for designing stochastic processes that possess rich features required for empirical realism in financial modeling. These features include state-dependent and time-inhomogeneous jumps and stochastic volatility. Moreover, our modeling framework offers analytical and computational tractability, which are required for operational purposes. We sketch applications across credit, commodity, equity, power generation systems and insurance.

(an ML talk in the AMS Speaker Series)

Mon 04/21/14, 01:30pm, 314 Clark Hall**Graph Classification via Signal Subgroups: Applications in Statistical Connectomics***Joshua Vogelstein, Duke University*

**Abstract:** This talk considers the following “graph classification” question: given a collection of graphs and associated classes, how can one predict the class of a newly observed graph? To address this question we propose a statistical model for graph/class pairs. This model naturally leads to a set of estimators to identify the class-conditional signal, or “signal- subgraph,” defined as the collection of edges that are probabilistically different between the classes. The estimators admit classifiers which are asymptotically optimal and efficient, but differ by their assumption about the “coherency” of the signal-subgraph (coherency is the extent to which the signal-edges “stick together” around a common subset of vertices). Via numerical analysis, the best estimator is shown to be not just a function of the coherency of the model, but also the number of training samples. These estimators are employed to address a contemporary neuroscience question: can we classify “connectomes” (brain-graphs) according to sex? The answer is yes, and significantly better than all benchmark algorithms considered. Synthetic data analysis demonstrates that even when the model is correct, given the relatively small number of training samples, the estimated signal-subgraph should be taken with a grain of salt. We conclude by discussing scalability considerations and several extensions.

**Bio:** My BS is from the Dept. of Biomedical Engineering at Washington University in St. Louis, my MS and PhD are from the Dept’s of Applied Mathematics and Statistics (AMS) and Neuroscience, respectively, both at Johns Hopkins. Since graduating, I completed a postdoc with Carey Priebe, held an Assistant Research Scientist appointment at AMS, and then moved to Duke University as a Senior Research Scientist jointly in the Dept’s of Statistical Science and Mathematics. I hold several additional appointments and am a member of a variety of centers, including the Institute for Data Intensive Engineering and Sciences. I’ve won some awards, served on some panels, consulted for some companies, and presented my work at a wide variety of venues, including some of the top machine learning, computer vision, signal processing, statistics, mathematics, computer science, general science, translational science, and neuroscience peer-reviewed conferences and journals. My primary research interests lie in discovering patterns and relationships between brain data (typically brain images) and mental data (eg, cognitive or psychiatric states). Because these data sets are often large and complex, extant statistical theory and methods often yield inadequate inferences. We therefore develop novel computational statistical theory and methods, designed to address specific scientific questions, and yet of more general interest. Much of this work focuses on the statistical analysis of populations of graph-valued data in neuroscience. My code is all open source on github, my data are all open data the Open Connectome Project, which I co-founded.

Mon 04/21/14, 12:15pm, SPH, Room W4030**PCA, SVD, and PVD: Regularization, Applications, Asymptotics***Dr. Haipeng Shen, University of North Carolina at Chapel Hill*

**Abstract:** Principal component analysis (PCA) is a ubiquitous technique for dimension reduction of multivariate data. Regularization of PCA becomes essential due to high dimensionality, for example, in techniques such as functional PCA and sparse PCA. Maximizing variance of a standardized linear combination of variables is the standard textbook treatment of PCA. A more general perspective of PCA is through fitting low rank approximations to the data matrix, i.e. singular value decomposition (SVD). I shall take this alternative perspective, introduce a general regularization framework for PCA, and review several recent works about PCA regularization, through applications in genetics and neuroimaging. If time permits, I shall conclude with some discussion about a general asymptotic framework for studying consistency properties of PCA,and some ongoing work about population value decomposition (PVD).

(an ML talk in the Biostatistics Speaker Series)

Mon 04/14/14, 12:00pm, SPH, Room W4030**Scans of Poisson Random Fields and Detection of Genome Variation***Nancy Zhang, University of Pennsylvania*

**Abstract:** Structural variation, which includes deletion, insertion, and inversion of stretches of DNA, comprise an important class of genome variation in the human population, and are implicated in many diseases. High throughput paired end short read sequencing allows for genome-wide detection of a wide spectrum of structural variation. We develop a general model for this data, based on a Poisson random field, under which signals that are characteristic for each type of structural change can be modeled using a likelihood based framework. Statistics derived from the model integrate information from coverage, insert length, and other aspects of the data, and thus has improved sensitivity over methods that only utilize any single feature. The Poisson random field model can potentially generalize to other types of sequencing data. We also describe how to control the false discovery rate for scan statistics of Poisson random fields, and illustrate our methods on 1000 genomes data. This is joint work with David Siegmund and Benjamin Yakir.

(an ML talk in the Biostatistics Speaker Series)

Wed 04/09/14, 10:00am, Clark Hall 314**Learning Hierarchical and Compositional Models and Fast Inference Algorithms for Object Direction and Tracking***Matt Tianfu Wu, University of California*

**Abstract:** Modern technological advances produce data at breathtaking scales and complexities such as the images and videos on the web. Such big data require highly expressive models for their representation, understanding and prediction. To fit such models to the big data, it is essential to develop practical learning methods and fast inferential algorithms. In this talk, I will mainly focus on the statistical inference framework in computer vision which learns to compute faster, and briefly overview the methods of learning hierarchical and compositional models for object detection and tracking. Accuracy performance and computational efficiency are often conflicting objectives, especially in big data applications. As the model complexity increases and the training/testing datasets become more massive, the computation can be either theoretically intractable or practically prohibitive. Unlike the algorithmic design in traditional paradigms of speeding up inferential computing, I developed a new ensemble-targeting and goal-guided framework of learning the computational decision policies which addresses two fundamental challenges in inferential computing: (i) Factorizing the computation of inference in a principled way such that they can be scheduled to obtain the optimal or near-optimal interpretation of a given input with a minimal computation cost (e.g., using early rejection/acceptance); and (ii) Balancing the loss of misclassification (i.e., the penalties C_FN and C_FP for a false negative and a false positive respectively) and the cost of computation (controlled by a parameter \lamda) in the scheduling.

**Bio:** Matt Tianfu Wu is currently a postdoc researcher in the center for vision, cognition, learning and art (VCLA) at University of California, Los Angeles (UCLA). He received a Ph.D. in Statistics from UCLA in 2011 under the supervision of Prof. Song-Chun Zhu. His research has been focused on the three aspects of statistics –statistical modeling, inference and learning, and computer vision: (i) Statistical learning of large scale and highly expressive hierarchical and compositional models from visual big data (images and videos). (ii) Statistical inference by learning near-optimal cost-sensitive decision policies that are similar in spirit to Wald’s sequential tests and Geman’sactive testing framework. (iii) Statistical theory of performance guaranteed learning algorithm (tow-round EM) and inference procedure (multi-armed bandit framework of optimally scheduled inferential computing).

(an ML talk in the CIS Speaker Series)

Tue 04/08/14, 12:15pm, 314 Clark Hall**Hierarchical Bayesian Methods for Multiple Outcomes in Network Meta-Analysis***Hwanhee Hong, University of Minnesota*

**Abstract:** Biomedical decision makers confronted with questions about the comparative effectiveness and safety of interventions often wish to combine all sources of data. Network meta-analysis (NMA) is a meta-analytic statistical technique that extends traditional meta-analysis of two treatments to simultaneously incorporate the findings from several studies on multiple treatments. In the NMA data framework, since few head-to-head comparisons are available, we must combine indirect and direct evidence.Many randomized clinical trials report multiple, possibly correlated outcomes for each treatment. Moreover, aggregated-level NMA data are typically sparse and researchers often choose study arms based on previous trials, further complicating identification of the best treatment. In this talk, we introduce novel Bayesian approaches for multiple outcomes simultaneously, rather than in separate NMA analyses. We do this by incorporating partially observed data and its correlation structure between outcomes through contrast-and arm-based parameterizations that consider any unobserved treatment arms as missing data to be imputed.Furthermore, availability of individual patient-level data (IPD) broadens the scope of NMA, and enables us to incorporate patient-level information into the analysis. As such, we propose a Bayesian IPD NMA modeling framework for bivariate continuous outcomes. We illustrate this approach using diabetes treatment, and show its practical implications. Finally, we close with a brief description of areas for future research.

(an ML talk in the CIS Speaker Series)

Tue 04/08/14, 12:00pm, Hackerman B17**Scalable Topic Models and Applications to Machine Translation***Ke Zhai, University of Maryland*

**Abstract:** Topic models are powerful tools for statistical analysis in text processing. Despite their success, application to large datasets is hampered by scaling inference to large parameter spaces. In this talk, we describe two ways to speed up topic models: parallelization and streaming. We propose a scalable and flexible implementation using variational inference on MapReduce. We further demonstrate two extensions of this model: using informed priors to incorporate word correlations, and extracting topics from a multilingual corpus. An alternative approach to achieve scalability is streaming, where the algorithm sees a small part of data at a time and update the model gradually. Although many streaming algorithms have been proposed for topic models, they all overlook a fundamental but challenging problem—the vocabulary is constantly evolving over time. We propose an online topic models with infinite vocabulary, which address the missing piece, and show that our algorithm is able to discover new words and refine topics on the fly. In addition, we also examine how topic models are helpful in acquiring domain knowledge and improving machine translation.

**Bio:** Ke Zhai is a PhD candidate in Department of Computer Science, University of Maryland, College Park, working with Prof. Jordan Boyd-Graber. He is expected to receive his PhD degree in Fall 2014. He works in the area of Machine Learning and Natural Language Processing, with an additional focus on the scalability and cloud computing. He also worked on several research projects on applying probabilistic Bayesian models in the area of image processing and dialogue modelling. He had open-sourced some libraries, including Mr. LDA, which is a package for large-scale topic modeling and has been adopted in research and industry.

(an ML talk in the CLSP Speaker Series)

Mon 04/07/14, 10:00am, 314 Clark Hall**Representation, Inference and Optimization Over Emerging Visual Data: From Simple to Richer Semantics***Ruonan Li, Harvard University*

**Abstract:** What does the decade of unprecedentedly amount of images and videos mean to computer vision? For inferring simple semantics such as object and activity categories, two predominant challenges are the lack of annotation combined with the effect of data-selection bias, and the increased variations among the instances of the same semantic concept. Beyond simple semantics, richer semantics are emerging, such as social interactions and social relationships, to which current machines remain quite blind. I will present three research vignettes that account for these challenges and opportunities. The first one introduces one of the first unsupervised domain adaption mechanisms to build recognition systems for new domains. The second one develops a global optimization framework on a Riemannian manifold, and I will show its effect on removing the spatio-temporal mismatch between two videos of the same type of activity. The last one extracts from imagery social interactions using a unified non-parametric representation, I will show its potential in enabling social network estimation from visual data.

**Bio:** Ruonan Li received the B.Eng. and M. S. degrees with honors in electrical engineering from Tsinghua University in 2004 and 2006 respectively. He received the Ph.D. degree in electrical engineering from the University of Maryland in 2011. He then joined the School of Engineering and Applied Sciences, Harvard University as a Postdoctoral Fellow, and was appointed Research Associate in 2012. His research is focused on understanding deep semantics in large-scale image data that emerges from and evolves with new acquisition and sharing modalities. His work is motivated by applications in object and behavior recognition; video analysis and spatio-temporal modeling; semi-supervised and unsupervised learning; and social signal processing.

(an ML talk in the CIS Speaker Series)

Tue 04/01/14, 01:00pm, 314 Clark Hall**Object Recognition and (Dynamic) Scene Analysis***Dr. Larry Davis, Institute for Advanced Computer Studies, University of Maryland, College Park*

**Abstract:** Data driven approaches to object recognition and scene analysis involve acquisition and annotation of training data to construct models for both individual object appearances and contextual information, and the design of an inference method that can apply these models to identify and describe objects and relationships in images. This talk will address a number of questions involved in the design of such scene analysis systems: 1. How can we obtain the appearance and contextual models at a reasonable cost? Challenges here are the high cost of data acquisition and annotation, and the long tail characteristics of visual data. I will first describe an approach for model acquisition using only weakly labeled images, and then describe an active learning approach that builds on this. Finally, I will discuss recent work on zero-shot learning for a video classification task. 2. Typical scene analysis systems operate on a representation of an image as a collection of segments. Purely bottom-up image segmentation algorithms generally do not produce segments that closely correspond to objects, and so researchers have turned to more data driven approaches to construct the set of segments over which inference is conducted. I will describe a few of the methods we use to construct such segmentations. 3. The use of these segmentation proposal methods results in a collection of segments that typically overlap and/or form a refinement hierarchy. I will discuss a scene analysis system that searches over such a collection of segments to find a best overall labeling of an image, and briefly discuss a recent extension to labeling sets of images. 4. Finally, if time permits I will discuss an approach to incremental inference for image and video analysis driven by first order logic models of context.

**Bio:** Larry S. Davis received his B.A. from Colgate University in 1970 and his M. S. and Ph. D. in Computer Science from the University of Maryland in 1974 and 1976 respectively. From 1977-1981 he was an Assistant Professor in the Department of Computer Science at the University of Texas, Austin. He returned to the University of Maryland as an Associate Professor in 1981. From 1985-1994 he was the Director of the University of Maryland Institute for Advanced Computer Studies. He was Chair of the Department of Computer Science from 1999-2012. He is currently a Professor in the Institute and the Computer Science Department, as well as Director of the Center for Automation Research. He was named a Fellow of the IEEE in 1997 and of the ACM in 2013. Prof. Davis is known for his research in computer vision and high performance computing. He has published over 100 papers in journals and 200 conference papers and has supervised over 35 Ph. D. students. During the past ten years his research has focused on visual surveillance and general video analysis.

(an ML talk in the CIS Speaker Series)

Mon 03/24/14, 12:15pm, SPH, Room W4030**Functions, Covariances and Learning Foreign Languages***Dr. John Aston, University of Cambridge, UK*

**Abstract:** Functional Data Analysis (FDA) is an area of statistics concerned with analysing statistical objects which are curves or surfaces. This makes FDA particularly applicable in phonetics, the branch of linguistics concerned with speech, in that each sound or phoneme which makes up a syllable can be characterised as a time-frequency spectrogram surface. One question of particular significance in phonetics is how languages are related, and concepts as simple as how close are two languages have proved difficult to quantify. Recent work on FDA and phonetics in Mandarin and Qiang (Chinese dialects) has suggested that the use of covariance functions might facilitate the finding of new measures of closeness of languages. However, working with covariance functions immediately raises the issue that the functions lie on a manifold (of positive definite operators) rather than in a standard Euclidean space. Here, a new metric for covariance functions is introduced which allows valid inference for covariance functions, but also which possess good properties when examining extrapolations, something that can be used to determine phonetic relationships. The theory and methodology of the new distance metrics for covariance functions will be illustrated using the some of the Romance languages (the languages which are have Latin as a root). [Joint work with Davide Pigoli (Milan), Pantelis Hadjipantelis (Warwick), Ian Dryden (Nottingham), and Piercesare Secchi (Milan)].

(an ML talk in the Biostatistics Speaker Series)

Tue 03/11/14, 12:00pm, Hackerman B17**Deep Learning of Generative Models***Yoshua Bengio, University of Montreal*

**Abstract:** Deep learning has been highly successful in recent years mostly thanks to progress in algorithms for training deep but supervised feedforward neural networks. These deep neural networks have become the state-of-the-art in speech recognition, object recognition, and object detection. What’s next for deep learning? We argue that progress in unsupervised deep learning algorithms is a key to progress on a number of fronts, such as better generalization to new classes from only one or few labeled examples, domain adaptation, transfer learning, etc. It would also be key to extend the output spaces from simple classification tasks to structured outputs, e.g., for machine translation or speech synthesis. This talk discusses some of the challenges involved in unsupervised learning of models with latent variables for AI tasks, in particular the difficulties due to the partition function, mixing between modes, and the potentially huge number of real or spurious modes. The manifold view of deep learning and experimental results suggest that many of these challenges could be greatly reduced by performing the hard work in the learned higher-level more abstract spaces discovered by deep learning, rather than in the space of visible variables. Further gains are seeked by exploiting the idea behind GSNs (Generative Stochastic Networks) and denoising auto-encoders: learning a Markov chain operator that generates the desired distribution rather than parametrizing that distribution directly. The advantage is that each step of the Markov chain transition involves fewer modes, i.e., a partition function that can be more easily approximated.

**Bio:** Yoshua Bengio (CS PhD, McGill University, 1991) was post-doc with Michael Jordan at MIT and worked at AT&T Bell Labs before becoming professor at U. Montreal. He wrote two books and around 200 papers, the most cited being in the areas of deep learning, recurrent neural networks, probabilistic learning, NLP and manifold learning. Among the most cited Canadian computer scientists and one of the scientists responsible for reviving neural networks research with deep learning in 2006, he sat on editorial boards of top ML journals and of the NIPS foundation, holds a Canada Research Chair and an NSERC chair, is a Fellow of CIFAR and has been program/general chair for NIPS. He is driven by his quest for AI through machine learning, involving fundamental questions on learning of deep representations, the geometry of generalization in high-dimension, manifold learning, biologically inspired learning, and challenging applications of ML. In February 2014, Google Scholar finds almost 16000 citations to his work, yielding an h-index of 55.

(an ML talk in the CLSP Speaker Series)

Thu 03/06/14, 01:30pm, Shaffer 101**Network Histograms and Universality of Blockmodel Approximation***Sofia Olhede, University College London*

**Abstract:** Networks are fast becoming part of the modern statistical landscape. Yet we lack a full understanding of their large-sample properties in all but the simplest settings, hindering the development of models and estimation methods that admit theoretical performance guarantees. A network histogram is obtained by fitting a stochastic blockmodel to a single observation of a network dataset. Blocks of edges play the role of histogram bins, and community sizes that of histogram bandwidths or bin sizes. Just as standard histograms allow for varying bandwidths, different blockmodel estimates can all be considered valid representations of an underlying probability model, subject to bandwidth constraints. We show that under these constraints, the mean integrated square error of the network histogram tends to zero as the network grows large, and we provide methods for optimal bandwidth selection-thus making the blockmodel a universal representation. With this insight, we discuss the interpretation of network communities in light of the fact that many different community assignments can all give an equally valid representation of the network. To demonstrate the fidelity-versus-interpretability tradeoff inherent in considering different numbers and sizes of communities, we show an example of detecting and describing new network community microstructure in political weblog data. This is joint work with Patrick Wolfe at UCL.

(an ML talk in the AMS Speaker Series)

Wed 03/05/14, 02:00pm, Clark Hall 314**Sparse Modeling for High-Dimensional Multi-Manifold Data Analysis***Ehsan Elhamifar, University of California, Berkeley*

**Abstract:** One of the most fundamental challenges facing scientists and engineers across different fields, such as signal/image processing, computer vision, robotics and bioinformatics, is the large amounts of high-dimensional data that need to be analyzed and understood. In this talk, I present efficient and theoretically guaranteed algorithms, based on the sparse representation theory, for the analysis of high-dimensional datasets by exploiting their underlying low-dimensional structures. I talk about algorithms for the two fundamental problems of clustering and subset selection in unions of subspaces and discuss the robustness of the algorithms to data nuisances. I show that these tools effectively advance the state-of-the-art data analysis in a wide range of important real-world problems, such as segmentation of motions in videos, clustering of images of objects, active learning and identification of hybrid dynamical systems.

**Bio:** Ehsan Elhamifar is a postdoctoral scholar in the department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He obtained his PhD in Electrical and Computer Engineering from the Johns Hopkins University. Ehsan is broadly interested in developing provably correct and efficient data analysis algorithms that can address the challenges of complex and large-scale high-dimensional datasets. Specifically, he focuses on the intrinsic low-dimensionality of the data and uses tools from convex analysis, sparse representation and compressive sensing to develop such algorithms. Ehsan obtained MSE and MS degrees in Applied Mathematics and Statistics and Electrical Engineering, respectively, from the Johns Hopkins University and Sharif University of Technology in Iran. Before that, he earned his BS with Honors in Biomedical Engineering from Amirkabir University of Technology, Iran.

(an ML talk in the CIS Speaker Series)

Tue 03/04/14, 01:30pm, Clark Hall 314**A Convex-Programming Framework for Super-Resolution***Carlos Fernandez-Granda, Stanford University*

**Abstract:** We propose a general framework to perform statistical estimation from low-resolution data, a crucial challenge in applications ranging from microscopy, astronomy and medical imaging to geophysics, signal processing and spectroscopy. First, we show that solving a simple convex program allows to super-resolve a superposition of point sources from bandlimited measurements with infinite precision. This holds as long as the sources are separated by a distance related to the cut-off frequency of the data. The result extends to higher dimensions and to the super-resolution of piecewise-smooth functions. Then, we provide theoretical guarantees that establish the robustness of our methods to noise in a non-asymptotic regime. Finally, we illustrate the flexibility of the framework by discussing extensions to the demixing of sines and spikes and to super-resolution from multiple measurements.

**Bio:** Carlos Fernandez-Granda is a PhD student in Electrical Engineering at Stanford University. Previously, he received an M.Sc. degree from Ecole Normale Superieure de Cachan and engineering degrees from Universidad Politécnica de Madrid and Ecole des Mines in Paris. His research interests are at the intersection of optimization, high-dimensional statistics and harmonic analysis, with emphasis on applications to computer vision, medical imaging and big data.

(an ML talk in the CIS Speaker Series)

Fri 02/28/14, 10:00am, Clark Hall 314**The fshape Theoretical and Numerical Framework for the Analysis of Poplulations of Textured Manifolds***Nicolas Charon, University of Copenhagen*

**Abstract:** In this talk, I will present some very recent advances related to the problem of extending shape analysis frameworks to population of textured manifolds, i.e geometrical shapes that carry additional signal information. Such kind of datasets are more and more common in the field of biomedical imaging. However, unlike classical images, the variability of the signal supports across individuals in addition to the photometric variations makes their analysis even more difficult. This work is meant to propose one possible general mathematical and numerical framework to generalize the previous large deformations’ settings on images and submanifolds. It consists in a proper definition of a notion of functional shapes (or fshapes) spaces as well as fshape metamorphoses to model geometrico-functional transformations between fshapes. I will also explain how to build appropriate data attachment terms on such spaces based on the extended concept of varifolds from geometric measure theory. All this together enables a well-posed formulation of the forward statistical atlas estimation problem for population of fshapes. I will present algorithms to perform the estimation numerically and show a few preliminary results obtained on synthetic and real datasets.

**Bio:** Nicolas Charon has obtained his PhD in applied Mathematics from Ecole Normale Supérieure de Cachan in 2013. His research topics include shape analysis, image processing and geometric measure theory with applications to computational anatomy. Since January 2014, he is a post-doctoral researcher in the Department of Computer Science of University of Copenhagen.

(an ML talk in the CIS Speaker Series)

Thu 02/27/14, 10:30am, Hackerman B17**Predicting Viral Infection from High-Dimensional Biomarker Trajectories***Minhua Chen, University of Chicago*

**Abstract:** There is often interest in predicting an individual’s latent health status based on high-dimensional biomarkers that vary over time. Motivated by time-course gene expression array data that we have collected in two influenza challenge studies, we develop a novel time-aligned Bayesian dynamic factor analysis methodology. The time course trajectories in the gene expressions are related to a relatively low-dimensional vector of latent factors, which vary dynamically starting at the latent initiation time of infection.

**Bio:** Minhua Chen received his Ph.D. degree from Duke University in May 2012 with Profs. Lawrence Carin and David Dunson working on Bayesian and Information-Theoretic Learning of High Dimensional Data. Currently he is working on statistical machine learning problems at University of Chicago in collaboration with Prof. John Lafferty. His research interests broadly span machine learning, signal processing and bioinformatics.

(an ML talk in the CS Speaker Series)

Mon 02/17/14, 12:15pm, Room W4030, School of Public Health**Estimation Over Multiple Undirected Graphs***Yunzhang Zhu, University of Minnesota*

**Abstract:** Graphical models are useful in analyzing complex systems involving a large number of interacting units. For example, in gene expression analysis, one key challenge is reconstruction of gene networks, describing gene-gene interactions. Observed attributes of genes, such as gene expressions, are used to reconstruct gene networks through graphical models. In this presentation, I will focus on estimation of multiple undirected graphs, motivated from network analysis under different experimental conditions, such as gene networks for disparate cancer subtypes. A method for pursuing two types of structures, clustering and sparseness, is proposed based on the penalized maximum likelihood. Theoretically, I will present a finite-sample error bound for reconstructing these two types of structures. This leads to consistent reconstruction of them simultaneously, permitting the number of unknown parameters to be exponential in the sample size, in addition to optimality of the proposed estimator as if the true structures were given a priori. Computationally, a necessary and sufficient partition rule is derived, on which estimation of multiple large graphs can proceed with smaller disjoint subproblems. This divide-and-conquer strategy permits efficient computation. Finally, I will demonstrate the proposed method on real examples. \

(an ML talk in the Biostatistics Speaker Series)

Tue 02/11/14, 01:00pm, Clark 314**Visualizing and Understanding Convolutional Networks***Rob Fergus, New York University*

**Abstract:** Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al.). However there is no clear understanding of why they perform so well, or how they might be improved. In this talk we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

**Bio:** Rob Fergus is an Associate Professor of Computer Science at the Courant Institute of Mathematical Sciences, New York University. He is also a Research Scientist at Facebook, working in their AI Research Group. He received a Masters in Electrical Engineering with Prof. Pietro Perona at Caltech, before completing a PhD with Prof. Andrew Zisserman at the University of Oxford in 2005. Before coming to NYU, he spent two years as a post-doc in the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, working with Prof. William Freeman. He has received several awards including a CVPR best paper prize, a Sloan Fellowship & NSF Career award and the IEEE Longuet-Higgins prize.

(an ML talk in the CIS Speaker Series)

Mon 02/03/14, 12:15pm, SPH, 615 N. Wolfe St, Room W4030**Bayesian Compression in High Dimensional Regression***Rajarshi Guhaniyogi, Duke University*

**Abstract:** As an alternative to variable selection or shrinkage in massive dimensional regression, we propose a novel idea to compress massive dimensional predictors to a lower dimension. Careful studies are carried out with several choices of compression matrices to understand their relative advantages and trade-offs. Compressing predictors with random compression dramatically reduces storage and computational bottlenecks, yielding accurate prediction when the predictors can be projected onto a low dimensional linear subspace with minimal loss of information about the response. As opposed to existing Bayesian dimensionality reduction approaches, the exact posterior distribution conditional on the compressed data is available analytically. This results in speeding up computation by many orders of magnitude while also bypassing robustness issues due to convergence and mixing problems with MCMC. Model averaging is used to reduce sensitivity to the random compression matrix, while accommodating uncertainty in the subspace dimension. Designed compression matrices additionally facilitate accurate lower dimensional subspace estimation along with good predictive inference. Novel modeling techniques are implemented to scale up these methods in presence of massive data or streaming data. Strong theoretical results are provided guaranteeing “optimal” convergence behavior for both these approaches. Practical performance relative to competitors is illustrated in extensive simulations and applications from genetic epidemiology and control problems of flying an F-16 aircraft.

(an ML talk in the Biostatistics Speaker Series)

Thu 01/30/14, 03:15pm, Gilman 400**Clearing the Jungle of Stochastic Optimization***Warren Powell, Princeton University*

**Abstract:** Mathematical programming has given us a widely used canonical framework for modeling deterministic optimization problems. However, if we introduce a random variable, the field fragments into a number of competing communities with names such as stochastic programming, dynamic programming, stochastic search, simulation optimization, and optimal control. Differences between these fields are magnified by notation and terminology, which hide subtle but more important differences in problem characteristics. Complicating matters further is a misunderstanding of basic terms such as dynamic program, policy, and state variable. While deterministic optimization problems are defined by solving for decisions, sequential stochastic optimization problems are characterized by finding functions known as policies. I will identify four fundamental classes of policies which unify the competing subcommunities into a common framework I call computational stochastic optimization. These ideas will be illustrated using applications drawn from energy, health, freight transportation, and vehicle navigation.

(an ML talk in the AMS Speaker Series)

Thu 01/30/14, 01:30pm, Shaffer 101**Energy and Uncertainty: From the laboratory to policy, unifying the fields of stochastic optimization***Warren Powell, Princeton University*

**Abstract:** Errors in forecasting wind and clouds, generator failures, market responses to price signals, unexpected outcomes in laboratory experiments, breakthroughs in storage technology, the heavy-tailed volatility of electricity prices, our shifting understanding of climate change, and the policies of state and federal governments. These are all examples of the different types of uncertainty that we have to deal with as we plan an energy future with the diminished footprint of fossil fuels. We have to understand these sources of uncertainty as we search for new materials, design and operate the power grid, work with energy markets, manage the energy demands of buildings and residences, and guide policy. We would like to guide these decisions using analytics, but we lack even a fundamental vocabulary for writing down these problems mathematically. Stochastic optimization is a jungle of mathematically arcane, computationally complex models and algorithms which have to be tuned for each application. In this talk, I will provide a brief overview of a mathematical framework that can be taught to any undergraduate in science and engineering. I will then unify a sprawling literature using four fundamental classes of policies. Finally, I will illustrate these using several applications in energy systems, closing with a description of SMART-ISO.

(an ML talk in the AMS Speaker Series)

Thu 11/21/13, 01:30pm, Whitehead 304**Functional Data Analysis of Large Neuroimaging Data***Hongtu Zhu, University of North Carolina*

**Abstract:** Motivated by recent work on studying massive imaging data in various neuroimaging studies,our group proposes several classes of spatial regression models including spatially varying coefficient models, spatial predictive Gaussian process models, tensor regression models, and Cox functional linear regression models for the joint analysis of large neuroimaging data and clinical and behavioral data. Our statistical models explicitly account for several stylized features of neuroimaging data: the presence of multiple piecewise smooth regions with unknown edges and jumps and substantial spatial correlations. We develop some fast estimation procedures to simultaneously estimate the varying coefficient functions and the spatial correlations. We systematically investigate the asymptotic properties (e.g., consistency and asymptotic normality) of the multiscale adaptive parameter estimates. Our Monte Carlo simulation and real data analysis have confirmed the excellent performance of our models in different applications.

(an ML talk in the AMS Speaker Series)

Tue 11/19/13, 12:00pm, Hackerman B17**Pursuit of Low-dimensional Structures in High-dimensional Data***Yi Ma, Microsoft*

**Abstract:** In this talk, we will discuss a new class of models and techniques that can effectively model and extract rich low-dimensional structures in high-dimensional data such as images and videos, despite nonlinear transformation, gross corruption, or severely compressed measurements. This work leverages recent advancements in convex optimization for recovering low-rank or sparse signals that provide both strong theoretical guarantees and efficient and scalable algorithms for solving such high-dimensional combinatorial problems. These results and tools actually generalize to a large family of low-complexity structures whose associated (convex) regularizers are decomposable. We illustrate how these new mathematical models and tools could bring disruptive changes to solutions to many challenging tasks in computer vision, image processing, and pattern recognition. We will also illustrate some emerging applications of these tools to other data types such as web documents, image tags, microarray data, audio/music analysis, and graphical models. This is joint work with John Wright of Columbia, Emmanuel Candes of Stanford, Zhouchen Lin of Peking University, and my students Zhengdong Zhang, Xiao Liang of Tsinghua University, Arvind Ganesh, Zihan Zhou, Kerui Min and Hossein Mobahi of UIUC.

**Bio:** Yi Ma is a Principal Researcher and the Research Manager of the Visual Computing group at Microsoft Research Asia in Beijing since January 2009. Before that he was a professor at the Electrical & Computer Engineering Department of the University of Illinois at Urbana-Champaign. His main research interest is in computer vision, high-dimensional data analysis, and systems theory. He is the first author of the popular vision textbook “An Invitation to 3-D Vision,” published by Springer in 2003. Yi Ma received his Bachelor’s degree in Automation and Applied Mathematics from Tsinghua University (Beijing, China) in 1995, a Master of Science degree in EECS in 1997, a Master of Arts degree in Mathematics in 2000, and a PhD degree in EECS in 2000, all from the University of California at Berkeley. Yi Ma received the David Marr Best Paper Prize at the International Conference on Computer Vision 1999, the Longuet-Higgins Best Paper Prize at the European Conference on Computer Vision 2004, and the Sang Uk Lee Best Student Paper Award with his students at the Asian Conference on Computer Vision in 2009. He also received the CAREER Award from the National Science Foundation in 2004 and the Young Investigator Award from the Office of Naval Research in 2005. He was an associate editor of IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) from 2007 to 2011. He is currently an associate editor of the International Journal of Computer Vision (IJCV), the IMA journal on Information and Inference, SIAM journal on Imaging Sciences, and IEEE transactions on Information Theory. He has served as the chief guest editor for special issues for the Proceedings of IEEE and the IEEE Signal Processing Magazine. He will also serve as Program Chair for ICCV 2013 and General Chair for ICCV 2015. He is a Fellow of IEEE.

(an ML talk in the CLSP Speaker Series)

Fri 11/15/13, 11:00am, Weinberg Auditorium (410 North Broadway)**A Bayes Rule for Subgroup Reporting — Bayesian Adaptive Enrichment Designs***Peter Muller, University of Texas at Austin*

**Abstract:** We discuss Bayesian inference for subgroups in clinical trials. We start with a decision theoretic approach, based on a straightforward extension of a 0/c utility function and a probability model across all possible subgroup models. We show that the resulting rule is essentially determined by the odds of subgroup models relative to the overall null hypothesis M0 of no treatment effects and relative to the overall alternative M1 of a common treatment effect in the entire patient population. This greatly simplifies posterior inference. We then generalize the approach to allow for subgroups that are characterized by arbitrary interactions of covariates. The two key elements of the generalization are a flexible nonparametric Bayesian response function and a separate description of the subgroup report that is not linked to the parametrization of the response model. We discuss an application to an adaptive enrichment design for targeted therapy.

(an ML talk in the CS Speaker Series)

Thu 11/14/13, 01:30pm, Whitehead 304**Parameter Estimation for Systems with Binary Subsystems***James Spall, Johns Hopkins APL*

**Abstract:** Consider a stochastic system of multiple subsystems, each subsystem having binary (“0 or 1”) output. The full system may have general binary or non-binary (e.g., Gaussian) output. Such systems are widely encountered in practice, and include engineering systems for reliability, communications and sensor networks, the collection of patients in a clinical trial, and Internet-based control systems. This paper considers the identification of parameters for such systems for general structural relationships between the subsystems and the full system. Maximum likelihood estimation (MLE) is used to estimate the mean output for the full system and the “success” probabilities for the subsystems. We present formal conditions for the convergence of the MLEs to the true full system and subsystem values as well as results on the asymptotic distributions for the MLEs. The MLE approach is well suited to providing asymptotic or finite-sample confidence bounds through the use of Fisher information or bootstrap Monte Carlo-based sampling.

(an ML talk in the AMS Speaker Series)

Thu 11/14/13, 10:45am, Hackerman B17**Casual Inference from Uncertain Time Series Data***Samantha Kleinberg, Stevens Institute of Technology*

**Abstract:** One of the key problems we face with the accumulation of massive datasets (such as electronic health records and stock market data) is the transformation of data to actionable knowledge. In order to use the information gained from analyzing these data to intervene to, say, treat patients or create new fiscal policies, we need to know that the relationships we have inferred are causal. Further, we need to know the time over which the relationship takes place in order to know when to intervene. In this talk I discuss recent methods for finding causal relationships and their timing from uncertain data with minimal background knowledge and their applications to observational health data.

**Bio:** Samantha Kleinberg is an Assistant Professor of Computer Science at Stevens Institute of Technology. She received her PhD in Computer Science from New York University in 2010 and was a Computing Innovation Fellow at Columbia University in the Department of Biomedical informatics from 2010-2012. Her research centers on developing methods for analyzing large-scale, complex, time-series data. In particular, her work develops methods for finding causes and automatically generating explanations for events, facilitating decision-making using massive datasets. She is the author of Causality, Probability, and Time (Cambridge University Press, 2012), and PI of an R01 from the National Library of Medicine.

(an ML talk in the CS Speaker Series)

Tue 11/12/13, 12:00pm, Hackerman B17**Bayesian Models for Social Interactions***Katherine Heller, Duke University*

**Abstract:** A fundamental part of understanding human behavior is understanding social interactions between people. We would like to be able to make better predictions about social behavior so that we can improve people’s social interactions or somehow make them more beneficial. This is very relevant in light of the fact that an increasing number of interactions are happening in online environments which we design, but is also useful for offline interactions such as structuring interactions in the work place, or even being able to advise people about their individual health based on who they’ve come into contact with. I will focus on two recent projects. In the first we use nonparametric Bayesian methods to predict group structure in social networks based on the social interactions of individuals over time, based on actual events (emails, conversations, etc.) instead of declared relationships (e.g. facebook friends). The time series of events is modeled using Hawkes processes, while relational grouping is done via the Infinite Relational Model. In the second, we use Graph-Coupled Hidden Markov Models to predict the spread of infection in a college dormitory. This is done by looking at a social network of students living in the dorm, and leveraging mobile phone data which reports on students’ locations and daily health symptoms.

**Bio:** Katherine Heller received a B.S. in Computer Science and Applied Math and Statistics from the State University of New York at Stony Brook, followed by an M.S. in Computer Science from Columbia University. In 2008 she received her Ph.D. from the Gatsby Computational Neuroscience Unit at University College London in the UK, and went on to do postdoctoral research in the Engineering Department at the University of Cambridge, and the Brain and Cognitive Science department at MIT. In 2012 she joined the Department of Statistical Science and Center for Cognitive Neuroscience at Duke University. She is the recipient of an NSF graduate research fellowship, an EPSRC postdoctoral fellowship, and an NSF postdoctoral fellowship. Her current research interests include Bayesian statistics, machine learning, and computational cognitive science.

(an ML talk in the CLSP Speaker Series)

Mon 11/11/13, 12:15pm, Room W4030 School of Public Health**Bayesian Learning from Big Data***David Dunson, Duke University*

**Abstract:** It has become common in many application areas to collect very big data sets, leading to challenges in statistical analyses. The overwhelming majority of the relevant literature focuses on frequentist and algorithmic approaches for point estimation, with no consideration of inferences under uncertainty. In biomedical applications and many other settings, such point estimates are essentially useless without accompanying uncertainty intervals. This talk will focus on scaling up Bayesian methods, with a particular emphasis on applications to nonparametric regression with a million predictors and estimating dependence networks based on huge multiway data. Applications in neurosciences, web advertising, epidemiology and social sciences provide motivating context.

(an ML talk in the Biostatistics Speaker Series)

Tue 11/05/13, 12:00pm, Hackerman B17**Submodularity and Big Data***Jeff Bilmes, University of Washington*

**Abstract:** The amount of data available today is a problem not only for humans but also for computer consumers of information. At the same time, bigger is different, and discovering how is an important challenge in big data sciences. In this talk, we will discuss how submodular functions can address these problems. After giving a brief background on submodularity, we will first discuss document summarization, and how one can achieve optimal results on a number of standard benchmarks using very efficient algorithms. Next we will discuss data subset selection for speech recognition systems, and how choosing a good subset has many advantages, showing results on both the TIMIT and the Fisher corpora. We will also discuss data selection for machine translation systems. Lastly, we will discuss similar problems in computer vision. The talk will include sufficient background to make it accessible to everyone.

**Bio:** Jeff Bilmes is a professor at the Department of Electrical Engineering at the University of Washington, Seattle Washington, and also an adjunct professor in Computer Science & Engineering and the department of Linguistics. He received his Ph.D. from the Computer Science Division of the department of Electrical Engineering and Computer Science, University of California in Berkeley. He was a 2001 NSF Career award winner, a 2002 CRA Digital Government Fellow, a 2008 NAE Gilbreth Lectureship award recipient, and a 2012/2013 ISCA Lecturer.

(an ML talk in the CLSP Speaker Series)

Tue 10/22/13, 01:00pm, Clark Hall Room 314**Domain Adaptive Dictionaries for Object Recognition***Dr. Vishal Patel, U of MD Institute for Advanced Computer Studies*

**Abstract:** Data-driven dictionaries have produced state-of-the-art results in various image processing and computer vision problems. However, when designing dictionaries, training and testing domains may be different, due to different view points and illumination conditions. In this talk, I will present a general framework for optimally representing both source and target data by a common dictionary.Specifically, I will describe a technique which can jointly learn projections of data in the two domains and a latent dictionary which can succinctly represent both the domains in the projected low-dimensional space.Applications in object and biometrics recognition will be presented.

**Bio:** Vishal Patel is a member of the research faculty at the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received the B.S. degrees in Electrical Engineering and Applied Mathematics (with honors) and the M.S. degree in Applied Mathematics from North Carolina State University, Raleigh, NC, in 2004 and 2005, respectively. He received his Ph.D. from the University of Maryland, College Park, MD, in Electrical Engineering in 2010. He received an ORAU postdoctoral fellowship in 2010.His research interests are in signal processing, computer vision and pattern recognition with applications to biometrics and imaging.He is a member of Eta Kappa Nu, Pi Mu Epsilon and Phi Beta Kappa.

(an ML talk in the CIS Speaker Series)

Tue 10/22/13, 12:00pm, Hackerman B17**Recent Progress in Acoustic Speaker and Language Recognition***Alan McCree, JHU HLTCOE*

**Abstract:** In this talk, I give an overview of recent progress in the fields of speaker and language recognition, with emphasis on our current work at the JHU HLTCOE. After a brief review of modern GMM subspace methods, in particular i-vectors, I will present approaches for pattern classification using these features, with an emphasis on simple Gaussian probabilistic models. For language recognition, these are quite effective, but our recent work has shown that discriminative training can improve performance. As a bonus, this also provides meaningful probability outputs without requiring a separate calibration process. For speaker recognition, on the other hand, classification is more difficult due to the limited enrollment data per speaker, and Bayesian methods have been successful. I will discuss a number of such methods, including the popular PLDA approach. Finally, I’ll describe our recent successes in adapting these Gaussian parameters to new domains when labeled training data is not available.

**Bio:** Alan McCree is a Principal Research Scientist at the JHU HLTCOE, where his primary interest is in the theory and application of speaker and language recognition. His research in speech and signal processing at the COE, and previously at MIT Lincoln Laboratory, Texas Instruments, AT&T Bell Laboratories, and Linkabit, has found applications in international speech coding standards, digital answering machines, talking toys, and cellular telephones. He has an extensive publication and patent portfolio, and was named an IEEE Fellow in 2005. He received his PhD from Georgia Tech in 1992 after undergraduate and graduate degrees from Rice University.

(an ML talk in the CLSP Speaker Series)

Thu 10/17/13, 12:00am, Clark Hall – 3rd Floor Conference Room**Detailed Mapping of Perceptual, Linguistic and Cognitive Information Across the Human Brain***Jack Gallant, UC Berkeley*

**Abstract:** The human neocortex is a complex structure consisting of hundreds of distinct functional maps, arranged in a densely interconnected network. Each map represents different aspects of sensory, motor or cognitive information. Thus, a full understanding of human brain function depends on recovering the arrangement and underlying dimensionality of these constituent maps. My laboratory has developed a computational modeling approach that recovers sensory and cognitive maps with unprecedented detail. (1) We first use functional magnetic resonance imaging to record brain activity while subjects perform naturalistic tasks (e.g., watching movies, listening to stories, searching for objects). We then use a voxel-wise modeling approach to estimate what specific information is represented in each individual cortical voxel. Analysis of the fit models within and across subjects reveals highly complex maps of perceptual, linguistic and cognitive information. (2) To identify organizational principles of these maps across subjects, we use a generative model that does not require normalization of individual brains into a common cortical space. The resulting models reveal dozens of distinct functional areas and gradients that are found consistently in all subjects. (3) Finally, the estimated encoding models are used to decode brain activity, in order to reconstruct dynamic perceptual and cognitive experiences solely from brain activity. In sum, our Voxel-wise modeling approach provides a powerful new method for mapping the representation of many different perceptual and cognitive processes across the human brain, and for decoding brain activity.

Mon 10/07/13, 12:00pm, Room W4030, School of Public Health**Characterizing Genomic Variation from Both Known and Unknown Sources***John Storey, The Lewis-Sigler Institute for Intregrative Genomics & Princeton University*

**Abstract:** One of the overarching goals of my lab is to develop and apply methods to characterize complex sources of variation in genomic data so that signals of interest may be accurately identified. We have developed methods that model high-dimensional data involving both known and unknown sources of systematic variation in settings applicable to data coming from microarrays, next-generation sequencing, and genome-wide association studies. We have also carried out experiments to directly dissect and better understand sources of variation in genomic data in terms of biological and technological factors. I will present several of our recent projects that seek to tackle these important challenges.

**Note:** We request that lunch be eaten before or after seminar and not during the seminar

(an ML talk in the Biostatistics Speaker Series)

Tue 10/01/13, 12:00pm, Hackerman B17**Modeling “Bootstrapping” in Language Acquisition: A Probabilistic Approach***Sharon Goldwater, University of Edinburgh*

**Abstract:** The term “bootstrapping” appears frequently in the literature on child language acquisition, but is often defined vaguely (if at all) and can mean different things to different people. In this talk, I define bootstrapping as the use of structured correspondences between different levels of linguistic structure as a way to aid learning, and discuss how probabilistic models can be used to investigate the nature of these correspondences and how they might help the child learner. I will discuss two specific examples, showing 1) that using correspondences between acoustic and syntactic information can help with syntactic learning (“prosodic bootstrapping”) and 2) that using correspondences between syntactic and semantic information in a joint learning model can help with learning both syntax and semantics while also simulating important findings from the child language acquisition literature.

**Bio:** Sharon Goldwater is a Reader (≈ US Associate Professor) in the Institute for Language, Cognition and Computation at the University of Edinburgh’s School of Informatics, and is currently a Visiting Associate Professor in the Department of Cognitive Science at Johns Hopkins University. She worked as a researcher in the Artificial Intelligence Laboratory at SRI International from 1998-2000 before starting her Ph.D. at Brown University, supervised by Mark Johnson. She completed her Ph.D. in 2006 and spent two years as a postdoctoral researcher at Stanford University before moving to Edinburgh. Her current research focuses on unsupervised learning for automatic natural language processing and computer modeling of language acquisition in children. She is particularly interested in Bayesian approaches to the induction of linguistic structure, ranging from phonemic categories to morphology and syntax.

(an ML talk in the CLSP Speaker Series)

Tue 09/24/13, 10:45am, Hackerman B17**Efficiently Learning to Behave Efficiently***Michael Littman, Brown University*

**Abstract:** The field of reinforcement learning is concerned with the problem of learning efficient behavior from experience. In real life applications, gathering this experience is time-consuming and possibly costly, so it is critical to derive algorithms that can learn effective behavior with bounds on the experience necessary to do so. This talk presents our successful efforts to create such algorithms via a framework we call KWIK (Knows What It Knows) learning. I’ll summarize the framework, our algorithms, their formal validations, and their empirical evaluations in robotic and videogame testbeds. This approach holds promise for attacking challenging problems in a number of application domains.

**Bio:** Michael L. Littman joined Brown University’s Computer Science as a full professor after ten years (including 3 as department chair) at Rutgers University. His research in machine learning examines algorithms for decision making under uncertainty. Littman has earned multiple awards for teaching and his research. He has served on the editorial boards of the Journal of Machine Learning Research and the Journal of Artificial Intelligence Research. In 2013, he was general chair of the International Conference on Machine Learning (ICML) and program co-chair of the Association for the Advancement of Artificial Intelligence Conference and he served as program co-chair of ICML 2009.

(an ML talk in the CS Speaker Series)

Mon 09/09/13, 12:15pm, Room W4030, School of Public Health**Algebraic, Sparse and Low-Rank Subspace Clustering***Rene Vidal, BME/Center for Imaging Science*

**Abstract:** In the era of data deluge, the development of methods for discovering structure in high- dimensional data is becoming increasingly important. Traditional approaches often assume that the data is sampled from a single low-dimensional manifold. However, in many applications in signal/image processing, machine learning and computer vision, data in multiple classes lie in multiple low-dimensional subspaces of a high-dimensional ambient space. In this talk, I will present methods from algebraic geometry, sparse representation theory and rank minimization for clustering and classification of data in multiple low-dimensional subspaces. I will show how these methods can be extended to handle noise, outliers as well as missing data. I will also present applications of these methods to video segmentation and face clustering.

(an ML talk in the Biostatistics Speaker Series)

Tue 08/27/13, 12:00am, Clark 314**Interactive Object Detection***Angela Yao, ETH Zurich*

**Abstract:** In recent years, the rise of digital image and video data available has led to an increasing demand for image annotation. In this paper, we propose an interactive object annotation method that incrementally trains an object detector while the user provides annotations. In the design of the system, we have focused on minimizing human annotation time rather than pure algorithm learning performance. To this end, we optimize the detector based on a realistic annotation cost model based on a user study. Since our system gives live feedback to the user by detecting objects on the ﬂy and predicts the potential annotation costs of unseen images, data can be efﬁciently annotated by a single user without excessive waiting time. In contrast to popular tracking-based methods for video annotation, our method is suitable for both still images and video. We have evaluated our interactive annotation approach on three datasets, ranging from surveillance, television, to cell microscopy.

**Bio:** Angela Yao is a post-doctoral researcher at the Computer Vision Laboratory at ETH Zurich. She completed her PhD in the same lab in 2012, where she works on topics related to human motion analysis, such as tracking, pose estimation and action recognition.

(an ML talk in the CIS Speaker Series)

Wed 05/08/13, 12:00pm, Hackerman B-17**Probabilistic Modeling for Large-scale Data Exploration***Chong Wang, CMU*

**Abstract:** We live in the era of “Big Data,” where we are surrounded by a dauntingly vast amount of information. How can we help people quickly navigate the data and acquire useful knowledge from it? Probabilistic models provide a general framework for analyzing, predicting and understanding the underlying patterns in the large-scale and complex data. Using a new recommender system as an example, I will show how we can develop principled approaches to advance two important directions in probabilistic modeling—exploratory analysis and scalable inference. First, I will describe a new model for document recommendation. This model not only gives better recommendation performance, but also provides new exploratory tools that help users navigate the data. For example, a user can adjust her preferences and the system can adaptively change the recommendations. Second, building a recommender system like this requires learning the probabilistic model from large-scale empirical data. I will describe a scalable approach for learning a wide class of probabilistic models, a class that includes our recommendation model, from massive data.

**Bio:** Chong Wang is a project scientist in the Machine Learning Department, Carnegie Mellon University. He received his PhD from Princeton University in 2012, advised by David Blei. His research lies in probabilistic graphical models and their applications to real-world problems. He has won several awards, including a best student paper award at KDD 2011, a notable paper award at AISTATS 2011 and a best student paper award honorable mention at NIPS 2009. He received the Google PhD Fellowship for machine learning and the Siebel Scholar Fellowship. His thesis was nominated for ACM Doctoral Dissertation Award by Princeton University in 2012.

(an ML talk in the CS Speaker Series)

Mon 05/06/13, 01:30pm, Clark 110**Bayesian inference with efficient neural population codes***Alan A. Stocker, University of Pennsylvania*

**Abstract:** Bayesian inference has been a successful and principled model framework for explaining perceptual behavior. However, it remains unclear how the brain represents probability distributions and performs the probabilistic computations necessary to perform such inference. This talk will present some recent ideas discussing how an efficient neural representation of sensory information can make Bayesian inference ‘easy’ in the sense that a simple read-out mechanism similar to the population vector can well approximate an optimal Bayesian decoder. Also, I will discuss the impact of these efficient codes on perceptual inference. In particular, I will show that they lead to surprising and counter-intuitive predictions that, however, well match some puzzling perceptual phenomenons previously considered incompatible with the Bayesian view. This work links the two established theories of optimality in perceptual neuroscience, efficient encoding and Bayesian decoding

Thu 05/02/13, 12:00pm, Maryland 109**Image Denoising via Non-convex Optimization***Yuantao Gu, Tsinghua University, MIT*

**Abstract:** Enhancing noisy images can be modeled as an optimization problem and is closely related to the problem of spares recovery. In this talk I will focus on a non-convex approach to sparse signal recovery, and reveal the mathematical mechanism beneath this approach. People usually feel hesitant about the non-convex penalties partly because of the multiple local minima of the optimization. However, I will introduce a class of non-convex functions and provide the performance guarantee of converging to the global optimum, while initialized as Least Square solution and iterated by gradient descent methods.

**Bio:** Dr. Yuantao Gu received the B.S. degree in Electronic Engineering from Xi’an Jiaotong University, Xi’an, China in 1998, and the Ph.D. degree in Electronic Engineering with honor from Tsinghua University, Beijing, China in 2003. He joined the faculty of Tsinghua University in 2003 and is currently an Associate Professor in the Department of Electronic Engineering. From 8/2012 to 5/2013, he is a Visiting Scientist at MIT.

(an ML talk in the ECE Speaker Series)

Wed 05/01/13, 04:00pm, Room W2030 School of Public Health**Regularized Matrix Decomposition and its Applications***Jianhua Huang, Texas A&M University*

**Abstract:** In this talk, I will review some recent works on regularized matrix decomposition. Depending on the application, the matrix in consideration can be the data matrix, the latent canonical parameter matrix of an exponential family distribution, or the regression coefficient matrix of a multivariate regression. I will discuss use of various penalty functions for regularization purpose, including sparsity-inducing penalty, roughness penalty, and their combinations. Governed by the structure of the problem, the penalty can be designed for one-way or two-way regularization. I will illustrate the key ideas using applications in functional principal components analysis, biclustering, reconstruction of MEG/EEG source signals, and protein structure clustering using protein backbone angular distributions. This talk is based on joint works with Andreas Buja, Xin Gao, Seokho Lee, Mehdi Maadooliat, Haipeng Shen, Siva Tian, and Lan Zhou. Key words: Biclustering, Functional data, MEG/EEG, Principal components analysis, roughness penalty, sparsity

(an ML talk in the Biostatistics Speaker Series)

Tue 04/30/13, 01:00pm, Clark 314**Towards Open-Universe Image Parsing with Broad Coverage***Svetlana Lazebnik, University of Illinois, Urbana*

**Abstract:** I will present our work on image parsing, or labeling each pixel in an image with its semantic category (e.g., sky, ground, tree, person, etc.). Our aim is to achieve broad coverage across hundreds of object categories in large-scale datasets that can continuously evolve. I will first describe our baseline nonparametric region-based parsing system that can easily scale to datasets with tens of thousands of images and hundreds of labels. Next, I will describe our approach to combining this region-based system with per-exemplar sliding window detectors to improve parsing performance on small object classes, which achieves state-of-the-art results on several challenging datasets. Joint work with J. Tighe.

**Bio:** Svetlana Lazebnik received her Ph.D. at the University of Illinois at Urbana-Champaign in 2006. From 2007 to 2011, she was an assistant professor of computer science at the University of North Carolina in Chapel Hill, and in 2012 she has returned to the University of Illinois as a faculty member. She is the recipient of an NSF CAREER Award, a Microsoft Research Faculty Fellowship, and a Sloan Foundation Fellowship. She is a member of the DARPA Computer Science Study Group and of the editorial board of the International Journal of Computer Vision. Her research interests focus on scene understanding and modeling the content of large-scale photo collections.

**Note:** Light lunch served at 12:30pm

(an ML talk in the CIS Speaker Series)

Fri 04/19/13, 12:00pm, Hackerman B17**The Latest in DNN Research at IBM: DNN-based features, Low-Rank Matrices for Hybrid DNNs, and Convolutional Neural Networks***Tara Sainath, IBM Research*

**Abstract:** Deep Neural Networks have become the state-of-the-art for acoustic modeling, showing gains between 10-30% relative compared to Gaussian Mixture Model/Hidden Markov Models . In this talk, I discuss how to improve the performance of these networks further. First, I present work on using these networks to extract NN-based features. I show that NN-based features offer between a 10-15% relative improvement on various LVCSR tasks compared to cross-entropy trained hybrid DNNs. Furthermore, NN-based features match the performance of sequence-trained hybrid DNNs while being 2x faster to train. I will also show that if a hybrid DNN is preferred, low-rank matrix factorization can also allow for a 50% reduction in parameters and a 2x speedup in training time. Second, I present work on Convolutional Neural Networks (CNNs), an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, CNNs are a more effective model for speech compared to DNNs. On a variety of LVCSR tasks, we find that CNN-based features offer an additional 4-12% improvement over DNN-based features.

**Bio:** ara Sainath received her B.S (2004), M. Eng (2005) and PhD (2009) in Electrical Engineering and Computer Science all from MIT. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. She joined the Speech and Language Algorithms group at IBM T.J. Watson Research Center upon completion of her PhD. She organized a Special Session on Sparse Representations at Interspeech 2010, as well as a workshop on Deep Learning at ICML 2013. In addition, she has served as a staff reporter for the IEEE Speech and Language Processing Technical Committee (SLTC) Newsletter. She currently holds over 30 US patents. Her research interests are in acoustic modeling, including deep belief networks and sparse representations.

(an ML talk in the CLSP Speaker Series)

Wed 04/17/13, 01:00pm, Clark 314**Component Analysis for Human Sensing***Fernando De la Torre, CMU*

**Abstract:** Enabling computers to understand human behavior has the potential to revolutionize many areas that benefit society such as clinical diagnosis, human computer interaction, and social robotics. A critical element in the design of any behavioral sensing system is to find a good representation of the data for encoding, segmenting, classifying and predicting subtle human behavior. In this talk I will propose several extensions of Component Analysis (CA) techniques (e.g., kernel principal component analysis, support vector machines, spectral clustering) that are able to learn spatio-temporal representations or components useful in many human sensing tasks. In particular, I will show how several extensions of CA methods outperform state-of-the-art algorithms in problems such as facial feature detection and tracking, temporal clustering of human behavior, early detection of activities, non-rigid feature matching, weakly-supervised visual labeling, and robust classification. The talk will be adaptive, and I will discuss the topics of major interest to the audience.

**Bio:** Fernando De la Torre received his B.Sc. degree in Telecommunications (1994), M.Sc. (1996), and Ph. D. (2002) degrees in Electronic Engineering from La Salle School of Engineering in Ramon Llull University, Barcelona, Spain. In 2003 he joined the Robotics Institute at Carnegie Mellon University , and since 2010 he has been a Research Associate Professor. Dr. De la Torre’s research interests include computer vision and machine learning, in particular face analysis, optimization and component analysis methods, and its applications to human sensing. He is Associate Editor at IEEE PAMI and leads the Component Analysis Laboratory (http://ca.cs.cmu.edu) and the Human Sensing Laboratory

(an ML talk in the CIS Speaker Series)

Tue 04/16/13, 10:45am, Hackerman B-17**Who is similar to my patient: Large-scale Patient Similarity Learning for Healthcare Analytics***Jimeng Sun, IBM TJ Watson Research Center*

**Abstract:** Heterogeneous and large volume of Electronic Health Records (EHR) data are becoming available in many healthcare institutes. Such EHR data from millions of patients serve as huge collective memory of doctors and patients over time. How to leverage that EHR data to help caregivers and patients to make better decisions? How to efficiently use these data to help clinical and pharmaceutical research? My research focuses on developing large-scale algorithms and systems for healthcare analytics. First, I will describe our healthcare analytic research framework, which provides an intuitive collaboration mechanism across interdisciplinary teams and an efficient computation framework for handling heterogeneous patient data. Second, I will present a core component of this framework, patient similarity learning that answers the following questions: · How to leverage physician feedback into the similarity computation? · How to integrate multiple patient similarity measures into a single consistent similarity measure? · How to present the similarity results and obtain user feedback in an intuitive and interactive way? I will illustrate the effectiveness of our proposed algorithms for patient similarity learning in several different healthcare scenarios. I will demonstrate an interactive visual analytic system that allows users to cluster patients and to refine the underlying patient similarity metric. Finally, I will highlight future work that I am pursuing.

**Bio:** Jimeng Sun is a research staff member at Healthcare Analytic Department of IBM TJ Watson Research Center. He leads research projects of medical informatics, especially in developing large-scale predictive and similarity analytics on healthcare applications. Sun has extensive research track records on data mining research: specialized in healthcare analytics, big data analytics, similarity metric learning, social network analysis, predictive modeling and visual analytics. He has published over 70 papers, filed over 20 patents (4 granted). He has received ICDM best research paper in 2007, SDM best research paper in 2007, and KDD Dissertation runner-up award in 2008. Sun received his B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, and PhD in Computer Science in Carnegie Mellon University in 2007, specialized on data mining on streams, graphs and tensor data.

(an ML talk in the CS Speaker Series)

Thu 04/11/13, 01:00pm, Whitehead 304**Detecting Time-dependent Subpopulations in Network Data***Lucy Robinson, Drexel University*

**Abstract:** We introduce a new class of latent process models for dynamic relational network data with the goal of detecting time-dependent structure. Network data are often observed over time, and static network models for such data may fail to capture relevant dynamic features. We present a new technique for identifying the emergence or disappearance of distinct subpopulations of vertices. In this formulation, a network is observed over time, with attributed edges appearing at random times. At unknown time points, subgroups of vertices may exhibit a change in behavior. Such changes may take the form of a change in the overall probability of connection within or between subgroups, or a change in the distribution of edge attributes. A mixture distribution for latent vertex positions is used to detect heterogeneities in connectivity behavior over time and over vertices. The probability of edges with various attributes at a given time is modeled using a latent-space stochastic process associated with each vertex. A random dot product model is used to describe the dependency structure of the graph. As an application we analyze the Enron email corpus.

(an ML talk in the AMS Speaker Series)

Wed 04/10/13, 04:00pm, Room W2030 School of Public Health**Little Data: How Traditional Statistical Ideas Remain Relevant in a Big-Data World***Andrew Gelman, Columbia University*

**Abstract:** At the end of the day, after all the processing, big data are being used to answer little-data questions such as, Does an observed pattern generalize to the larger population?, or Could it be explained by alternative processes (sometimes called “chance”)? We discuss some recent ideas in the world of “little data” that remain of big importance.

(an ML talk in the Biostatistics Speaker Series)

Tue 04/09/13, 12:00pm, Hackerman B17**Learning with Marginalized Corrupted Features***Kilian Weinberger, University of Washington in St. Louis*

**Abstract:** If infinite amounts of labeled data are provided, many machine learning algorithms become perfect. With finite amounts of data, regularization or priors have to be used to introduce bias into a classifier. We propose a third option: learning with marginalized corrupted features. We (implicitly) corrupt existing data as a means to generate additional, infinitely many, training samples from a slightly different data distribution — this is computationally tractable, because the corruption can be marginalized out in closed form. Our framework leads to machine learning algorithms that are fast, generalize well and naturally scale to very large data sets. We showcase this technology as regularization for general risk minimization and for marginalized deep learning for document representations. We provide experimental results on part of speech tagging as well as document and image classification.

**Bio:** Kilian Q. Weinberger is an Assistant Professor in the Department of Computer Science & Engineering at Washington University in St. Louis. He received his Ph.D. from the University of Pennsylvania in Machine Learning under the supervision of Lawrence Saul. Prior to this, he obtained his undergraduate degree in Mathematics and Computer Science at the University of Oxford. During his career he has won several best paper awards at ICML, CVPR and AISTATS. In 2011 he was awarded the AAAI senior program chair award and in 2012 he received the NSF CAREER award. Kilian Weinberger’s research is in Machine Learning and its applications. In particular, he focuses on high dimensional data analysis, metric learning, machine learned web-search ranking, transfer- and multi-task learning as well as bio medical applications.

(an ML talk in the CLSP Speaker Series)

Fri 04/05/13, 12:00pm, Hackerman B17**Learning from Speech Production for Improved Recognition***Karen Livescu, TTI Chicago*

**Abstract:** peech production has motivated several lines of work in the speech recognition research community, including using articulator positions predicted from acoustics as additional observations and using discrete articulatory features as lexical units instead of or in addition to phones. Unfortunately, our understanding of speech production is still quite limited, and articulatory data is scarce. How can we take advantage of the intuitive usefulness of speech production, without relying too much on noisy information? This talk will cover recent work exploring several ideas in this area, with the theme of using machine learning to automatically infer information where our knowledge and data are lacking. The talk will include work on deriving new acoustic features using articulatory data in a multi-view learning setting, as well as lexical access and spoken term detection using hidden articulatory features. Biography

**Bio:** Karen Livescu is an Assistant Professor at TTI-Chicago, where she has been since 2008. She completed her PhD in 2005 at MIT in the Spoken Language Systems group of the Computer Science and Artificial Intelligence Laboratory. In 2005-2007 she was a post-doctoral lecturer in the MIT EECS department. Karen’s main research interests are in speech and language processing, with a slant toward combining machine learning with knowledge from linguistics and speech science. She is a member of the IEEE Spoken Language Technical Committee and has organized or co-organized a number of recent workshops, including the ISCA SIGML workshops on Machine Learning in Speech and Language Processing and Illinois Speech Day. She is co-organizing the upcoming Midwest Speech and Language Days and the Interspeech 2013 Workshop on Speech Production in Automatic Speech Recognition.

(an ML talk in the CLSP Speaker Series)

Tue 04/02/13, 12:00pm, Clark 110**Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation***Inderjit Dhillon, University of Texas at Austin*

**Abstract:** The L1-regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix from very limited samples. In this talk, I will present an algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with substantial enhancements that leverage the structure of the sparse Gaussian MLE problem. A divide and conquer approach, combined with our quadratic approximation method, allows us to scale the algorithm to large-scale instances. I will present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods. This is joint work with Cho-Jui Hsieh, Matyas Sustik and Pradeep Ravikumar.

**Bio:** Inderjit Dhillon is a Professor of Computer Science at The University of Texas at Austin, and Director of the Center for Large-Scale Data Mining. Inderjit received his B.Tech. degree from the Indian Institute of Technology at Bombay, and Ph.D. from the University of California at Berkeley. At Berkeley, Inderjit studied computer science and mathematics with Beresford Parlett and Jim Demmel. His thesis work led to the fastest known numerically stable algorithm for the symmetric tridiagonal eigenvalue/eigenvector problem. Software based on this work is now part of all state-of-the-art numerical software libraries. Inderjit’s current research interests are in large-scale data mining, machine learning, network analysis, numerical optimization and scientific computing. Inderjit received an NSF Career Award in 2001, a University Research Excellence Award in 2005, the SIAG Linear Algebra Prize in 2006, the Moncrief Grand Challenge Award in 2010 and the SIAM Outstanding Paper Prize in 2011.

**Note:** Light lunch served at 11:30am

(an ML talk in the CIS Speaker Series)

Fri 03/29/13, 11:00am, Shaffer 3**Clustering Algorithms for Streaming and Online Settings***Claire Monteleoni, George Washington University*

**Abstract:** Clustering techniques are widely used to summarize large quantities of data (e.g. aggregating similar news stories), however their outputs can be hard to evaluate. While a domain expert could judge the quality of a clustering, having a human in the loop is often impractical. Probabilistic assumptions have been used to analyze clustering algorithms, for example i.i.d. data, or even data generated by a well-separated mixture of Gaussians. Without any distributional assumptions, one can analyze clustering algorithms by formulating some objective function, and proving that a clustering algorithm either optimizes or approximates it. The k-means clustering objective, for Euclidean data, is simple, intuitive, and widely-cited, however it is NP-hard to optimize, and few algorithms approximate it, even in the batch setting (the algorithm known as “k-means” does not have an approximation guarantee). Dasgupta (2008) posed open problems for approximating it on data streams. In this talk, I will discuss my ongoing work on designing clustering algorithms for streaming and online settings. First I will present a one-pass, streaming clustering algorithm which approximates the k-means objective on finite data streams. This involves analyzing a variant of the k-means++ algorithm, and extending a divide-and-conquer streaming clustering algorithm from the k-medoid objective. Then I will turn to endless data streams, and introduce a family of algorithms for online clustering with experts. We extend algorithms for online learning with experts, to the unsupervised setting, using intermediate k-means costs, instead of prediction errors, to re-weight experts. When the experts are instantiated as k-means approximate (batch) clustering algorithms run on a sliding window of the data stream, we provide novel online approximation bounds that combine regret bounds extended from supervised online learning, with k-means approximation guarantees. Notably, the resulting bounds are with respect to the optimal k-means cost on the entire data stream seen so far, even though the algorithm is online. I will also present encouraging experimental results. This talk is based on joint work with Nir Ailon, Ragesh Jaiswal, and Anna Choromanska.

**Bio:** Claire Monteleoni is an assistant professor of Computer Science at The George Washington University, which she joined in 2011. Previously, she was research faculty at the Center for Computational Learning Systems, and adjunct faculty in the Department of Computer Science, at Columbia University. She was a postdoc in Computer Science and Engineering at the University of California, San Diego, and completed her PhD and Masters in Computer Science, at MIT. Her research focus is on machine learning algorithms and theory for problems including learning from data streams, learning from raw (unlabeled) data, learning from private data, and Climate Informatics: accelerating discovery in Climate Science with machine learning. Her papers have received several awards. In 2011, she co-founded the International Workshop on Climate Informatics, which is now entering its third year. She is on the Editorial Board of the Machine Learning Journal, and she served as an Area Chair for ICML 2012.

(an ML talk in the CS Speaker Series)

Tue 03/12/13, 10:45am, Hackerman B17**When Machines Learn About Humans***Moritz Hardt, IBM Almaden*

**Abstract:** The “human element” in data introduces fundamental algorithmic challenges such as protecting individual privacy, ensuring fairness in classification, and, designing algorithms that are robust to population outliers and adversarial conditions. This talk focuses on the fruitful interplay between privacy and robustness. We will first give a simple and practical method for computing the principal components of a data set under the strong notion of differential privacy. Our algorithm always guarantees privacy while its utility analysis circumvents an impossibility result in differential privacy using a realistic assumption central to robust principal component analysis. We then turn to the problem of analyzing massive data sets using few linear measurements—an algorithmic paradigm known as “linear sketching”. Here we prove a “no free lunch” theorem showing that the computational efficiency of linear sketches comes at the cost of robustness. Indeed, we show that efficient linear sketches cannot guarantee correctness on adaptively chosen inputs. Our result builds on a close connection to privacy and can be seen as a novel “reconstruction attack” in the privacy setting.

**Bio:** Moritz Hardt is a post-doctoral researcher in the theory group at IBM Research Almaden. He completed his PhD in Computer Science at Princeton University in 2011, advised by Boaz Barak. His current work focuses on the algorithmic foundations of privacy, fairness and robustness in statistical data analysis. His general research areas include algorithms, machine learning and complexity theory.

(an ML talk in the CS Speaker Series)

Thu 03/07/13, 10:45am, Hackerman B17**Unraveling the genetics of disease using informed probabilistic models***Alexis Battle, Stanford University*

**Abstract:** Recent technological advances have allowed us to collect genomic data on an unprecedented scale, with the promise of revealing genetic variants, genes, and pathways disrupted in many diseases. However, identifying relevant genetic factors and ultimately unraveling the genetics of complex traits from such high dimensional data have presented significant statistical and computational challenges. In this domain, where spurious associations and lack of statistical power are major factors, I have developed machine learning methods based on robust probabilistic models that leverage biological structure, prior knowledge, and diverse sources of evidence. In particular, I have developed Bayesian methods that utilize structured information, including gene networks and cellular pathways, and transfer learning to propagate information across related genes and diseases. In this talk, I will discuss the application of such models to diverse traits including human disease and cellular traits derived from RNA-sequencing. Using this approach, I demonstrate an improvement in uncovering genetic variants affecting complex traits, along with the interactions and intermediate cellular mechanisms underlying genetic effects.

**Bio:** Alexis Battle is a PhD candidate in Computer Science at Stanford University. Her research in computational biology focuses on machine learning and probabilistic models for the genetics of complex traits. Alexis received her BS in Symbolic Systems from Stanford, and spent four years as a member of the technical staff at Google. She is the recipient of an NSF Graduate Research Fellowship and an NIH NIGMS training award.

(an ML talk in the CS Speaker Series)

Mon 03/04/13, 10:00am, COE (Stieff Building), North Conference room**Extracting Knowledge from Informal Text***Alan Ritter, University of Washington*

**Abstract:** The internet has revolutionized the way we communicate, leading to a constant flood of informal text available in electronic format, including: email, Twitter, SMS and the clinical text found in electronic medical records. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale. In this talk I will discuss several challenges and opportunities which arise when applying NLP and IE to informal text, focusing specifically on Twitter, which has recently rose to prominence, challenging the mainstream news media as the dominant source of real-time information on current events. I will describe several NLP tools we have adapted to handle Twitter’s noisy style, and present a system which leverages these to automatically extract a calendar of popular events occurring in the near future (http://statuscalendar.cs.washington.edu). I will further discuss fundamental challenges which arise when extracting meaning from such massive open-domain text corpora. Several probabilistic latent variable models will be presented, which are applied to infer the semantics of large numbers of words and phrases and also enable a principled and modular approach to extracting knowledge from large open-domain text corpora.

**Bio:** Alan Ritter is a Ph.D. candidate in the Department of Computer Science and Engineering at the University of Washington. His interests include NLP in short informal messages (e.g. Twitter), modeling lexical semantics with latent variables, modeling conversations in social media and paraphrasing between different styles of language (for example translating Shakespeare’s plays, noisy Twitter text or technical writing into standard English and vice versa). He was awarded an NDSEG fellowship, and won the best student paper award at IUI in 2009.

(an ML talk in the HLTCOE Speaker Series)

Wed 02/27/13, 12:00pm, Clark 110**Characterizing Abnormal Brain Networks***Archana Venkataraman, Massachusetts Institute of Technology*

**Abstract:** Connectivity analysis quantifies the relationship between brain regions. For example, anatomical connectivity informs us about neural pathways, or the internal wiring of the brain. In contrast, functional connectivity assesses neural synchrony, which relates to patterns of communication. These interactions are crucial to developing a comprehensive understanding of the brain. In this talk I will present a generative framework that combines anatomical and functional connectivity information to identify patterns associated with a neurological disorder. My framework relies on a latent structure, which captures hidden interactions within the brain. This includes the relationship between anatomy and function and the propagation of disease. The latent variables are complemented by an intuitive likelihood model for the observed neuroimaging data. The resulting algorithm produces clinically meaningful results by simultaneously localizing the centers of abnormal activity and the network of disrupted connectivity. I demonstrate that the model learns stable differences between a control and a schizophrenia population. I also tailor this framework to evaluate presurgical planning for epilepsy.

**Bio:** Archana Venkataraman is a postdoctoral associate in the Medical Vision Group at the Massachusetts Institute of Technology (MIT). Her research focuses on multimodal and clinical applications of medical imaging. Her objective is to use engineering principles, such as probabilistic modeling, signal processing and network theory, to improve the diagnosis and treatment of debilitating neurological illnesses. Archana completed her B.S., M.S. and Ph.D. in Electrical Engineering at MIT in 2006, 2007 and 2012, respectively. She is a recipient of the NIH Advanced Multimodal Neuroimaging Training Grant, the National Defense Science and Engineering Graduate Fellowship, the Siebel Scholarship and the MIT Provost Presidential Fellowship.

(an ML talk in the ECE Speaker Series)

Tue 02/26/13, 10:45am, Hackerman B17**Perturbation, Optimization and Statistics for Effective Machine Learning***Tamir Hazan, TTI Chicago*

**Abstract:** Predictions in modern statistical inference problems can be increasingly understood in terms of discrete structures such as arrangements of objects in computer vision, phonemes in speech recognition, parses in natural language processing, or molecular structures in computational biology. For example, in image scene understanding one needs to jointly predict discrete semantic labels for every pixel, e.g., whether it describes a person, bicycle, bed, etc. In a fully probabilistic treatment, all possible alternative assignments are considered thus requiring to estimate exponentially many structures with their respective weights. To relax the exponential complexity we describe two different approaches: Dual decomposition (e.g., convex belief propagation) and predictions under random perturbations. The second approach leads us to a new approximate inference framework that is based on max-statistics which capture long-range interactions, contrasting the current framework of dual decomposition that relies on pseudo-probabilities. We demonstrate the effectiveness of our approaches on different complex tasks, outperforming the state-of-the-art results in scene understanding, depth estimation, semantic segmentation and phoneme recognition.

**Bio:** Tamir Hazan received his PhD from the Hebrew University of Jerusalem (2009) and he is currently a research assistant professor at TTI Chicago. Tamir Hazan’s research describes efficient methods for reasoning about complex models. His work on random perturbations was presented in the machine learning best papers track at AAAI 2012. Tamir Hazan’s research also includes the primal-dual norm-product belief propagation algorithm which received a best paper award at UAI 2008. Currently, these techniques outperform the state-of-the-art in different computer vision and language processing tasks.

(an ML talk in the CS Speaker Series)

Mon 02/25/13, 10:00am, COE (Stieff Building), North Conference room**Building scholarly methodologies with large-scale topic analysis***David Mimno, Princeton*

**Abstract:** In the last ten years we have seen the creation of massive digital text collections, from Twitter feeds to million-book libraries, all in dozens of languages. At the same time, researchers have developed text mining methods that go beyond simple word frequency analysis to uncover thematic patterns. When we combine big data with powerful algorithms, we enable analysts in many different fields to enhance qualitative perspectives with quantitative measurements. But these methods are only useful if we can apply them at massive scale and distinguish consistent patterns from random variations. In this talk I will describe my work building reliable topic modeling methodologies for humanists, social scientists and science policy officers.

**Bio:** David Mimno is a postdoctoral researcher in the Computer Science department at Princeton University. He received his PhD from the University of Massachusetts, Amherst. Before graduate school, he served as Head Programmer at the Perseus Project, a digital library for cultural heritage materials, at Tufts University. He is supported by a CRA Computing Innovation fellowship.

(an ML talk in the HLTCOE Speaker Series)

Fri 02/22/13, 12:00pm, Hackerman B17**Big Data Goes Mobile***Kenneth Church, IBM*

**Abstract:** What is “big”? Time & Space? Expense? Pounds? Power? Size of machine? Size of market? We will discuss many of these dimensions, but focus on throughput and latency (mobility of data). If our clouds can’t import and export data at scale, they may turn into roach motels where data can check in; but it can’t check out. DataScope is designed to make it easy to import and export 100s of TBs of disks. Amdahl’s Laws have stood up remarkably well to the test of time. These laws explain how to balance memory, cycles and IO. There is an opportunity to extend these laws to balance for mobility.

**Bio:** Ken is currently at IBM working on Siri-like applications of speech on phones. Before that, he was the Chief Scientist of the HLTCOE at JHU. He has worked at Microsoft and AT&T, as well. Education: MIT (undergrad and graduate). He enjoys working with large datasets. Back in the 1980s, we thought that Associated Press newswire (1million words per week) was big, but he has since had the opportunity to work with much larger datasets such as AT&T’s billing records and Bing’s web logs. He has worked on many topics in computational linguistics including: web search, language modeling, text analysis, spelling correction, word-sense disambiguation, terminology, translation, lexicography, compression, speech (recognition and synthesis), OCR, as well as applications that go well beyond computational linguistics such as revenue assurance and virtual integration (using screen scraping and web crawling to integrate systems that traditionally don’t talk together as well as they could such as billing and customer care). Service: past president of ACL and former president of SIGDAT (the organization that organizes EMNLP).UPCOMING

(an ML talk in the CLSP Speaker Series)

Thu 02/21/13, 03:00pm, Barton 117**Pushing the Limits of Sparse Recovery: The Power of Correlation Awareness***Piya Pal, California Institute of Technology*

**Abstract:** Modern Sensing and Signal Processing Systems face a fundamental challenge in the extraction of meaningful information from large, complex and often distributed datasets. Such “Big Data” routinely arises in sensor networks, genomics, physiology, imaging, particle physics, social networks, and so forth. Fortunately however, the amount of information buried in the data in most scenarios is substantially lower compared to the number of raw samples acquired. This key observation has led to the design of sensing systems that can directly capture the information using far fewer samples typically acquired via random projections. In many natural scenarios however, the physics of the problem itself imposes “structure” on the ensuing acquisition scheme. Also often, one can make informed realistic assumptions about the “statistical properties” of the data. Recent approaches to sparse sensing and reconstruction have only begun to investigate the advantages that such structure and statistical assumptions can offer over more traditional approaches to sparse recovery. In this talk, I will describe how “sparse structured sampling” strategies can be used to dramatically push the limits of extraction of low dimensional information buried in high dimensional data (e.g. the spatio temporal signal received by an array of sensors), much beyond what is guaranteed by existing methods. In particular, I will develop novel sparse samplers (temporal and spatial) in one and multiple dimensions that can directly exploit the correlation and/or higher order moments of the data to greatly increase the number of identifiable parameters. I will also develop new fast, efficient, robust and scalable algorithms for sparse recovery that work on a low dimensional data and guarantees recovery of sparsity levels that can be orders of magnitude larger than that achieved by existing approaches. This new paradigm of sparse support recovery that explicitly establishes the fundamental interplay between sampling, statistics of data and the underlying sparsity, leads to exciting future research directions in a variety of application areas, and also gives rise to new questions that can lead to stand-alone theoretical results in their own right.

**Bio:** Piya Pal is a graduate student in the Department of Electrical Engineering at California Institute of Technology (Caltech), Pasadena, Calif., working in the Digital Signal Processing Lab, supervised by Professor P. P. Vaidyanathan. She received the B.Tech degree in Electronics and Electrical Communication Engineering from the Indian Institute of Technology, Kharagpur in 2007 and the M.S. degree in Electrical Engineering from Caltech in 2008. Her research interests include statistical signal processing, sparse sampling and reconstruction techniques, optimization, and sensor array processing. She received the Best Student Paper Award at the 14th IEEE DSP Workshop, 2011 held at Sedona, Ariz. She was also one of the recipients of the Student Paper Award at the 45th Asilomar Conference on Signals, Systems and Computers, 2011 held at Pacific Grove, Calif.. She is one of the three winners of the Everhart Lecture Series for the year 2013, selected across all disciplines at Caltech.

(an ML talk in the ECE Speaker Series)

Thu 02/21/13, 01:30pm, Levering Hall (the Great Hall)**Improving Christofides’ Algorithm for the s-t Path Traveling Salesman Problem***David Shmoys, Cornell University*

**Abstract:** In 1976, Christofides gave an approximation algorithm for the traveling salesman problem (TSP) with metric costs that was guaranteed to find a tour that was no more than 3/2 times the length of the shortest tour visiting a given set of n cities; it remains an open problem to give a polynomial-time algorithm with a better performance guarantee. There have been a number of recent results that yield improvements for significant special cases, and for related problems. In this talk, we shall present an approximation algorithm for the s-t path TSP with metric costs, which is guaranteed to find a solution of cost within a factor of the golden ratio of optimal in polynomial time; in this variant, in addition to the pairwise distances among n points, there are two prespecified endpoints s and t, and the problem is to find a shortest Hamiltonian path between s and t. Hoogeveen showed that the natural variant of Christofides’ algorithm for this problem is a 5/3-approximation algorithm, and this asymptotically tight bound has been the best approximation ratio known until now. We modify this algorithm so that it chooses the initial spanning tree based on an optimal solution to a natural linear programming relaxation, rather than a minimum spanning tree; we prove this simple but crucial modification leads to an improved approximation ratio, surpassing the 20-year-old barrier set by the natural Christofides’ algorithm variant.

**Bio:** David Shmoys is a Professor of Operations Research and Information Engineering as well as of Computer Science at Cornell University. He obtained his Ph.D. in Computer Science from the University of California at Berkeley in 1984, and held postdoctoral positions at MSRI in Berkeley and Harvard University, and a faculty position at MIT before joining the Cornell faculty. Shmoys’s research has focused on the design and analysis of efficient algorithms for discrete optimization problems, with applications including scheduling, inventory theory, computational biology, and most recently, computational sustainability. He recently published (jointly with David Williamson) a graduate-level text on The Design of Approximation Algorithms. He is a Fellow of the ACM, a SIAM Fellow, and was an NSF Presidential Young Investigator; he has served on numerous editorial boards, and is currently on the Scientific Advisory Board of Mathematics of the Planet Earth 2013, and is Chair of the IEEE Technical Committee on Mathematical Foundations of Computing.

**Note:** The Goldman Lecture

In short, his primary focus is on the design and analysis of efficient algorithms for discrete optimization problems, in particular, on approximation algorithms for NP-hard (you know, the easy problems) and other computationally intractable problems.

Please let me know if you would like to join him for lunch or dinner on Thursday:

lunch: noon – 1pm (location to be announced)

dinner: 5:30pm – 7:30pm (location in downtown Baltimore to be announced)

If you are interested in a 30 minute time slot to meet with him, please fill out the following doodle.

http://doodle.com/axg46r8erd8znadf

(an ML talk in the AMS Speaker Series)

Tue 02/19/13, 01:00pm, Clark 314**Efficient Algorithms for Semantic Scene Parsing***Raquel Urtasun, Toyota Technological Institute at Chicago*

**Abstract:** Developing autonomous systems that are able to assist humans in everyday’s tasks is one of the grand challenges in modern computer science. Notable examples are personal robotics for the elderly and people with disabilities, as well as autonomous driving systems which can help decrease fatalities caused by traffic accidents. In order to perform tasks such as navigation, recognition and manipulation of objects, these systems should be able to efficiently extract 3D knowledge of their environment. While a variety of novel sensors have been developed in the past few years, in this work we focus on the extraction of this knowledge from visual information alone. In this talk, I’ll show how Markov random fields provide a great mathematical formalism to extract this knowledge. In particular, I’ll focus on a few examples, i.e., 3D reconstruction, 3D layout estimation, 2D holistic parsing and object detection, and show representations and inference strategies that allow us to achieve state-of-the-art performance as well as several orders of magnitude speed-ups.

**Bio:** Raquel Urtasun is an Assistant Professor at TTI-Chicago a philanthropically endowed academic institute located in the campus of the University of Chicago. She was a visiting professor at ETH Zurich during the spring semester of 2010. Previously, she was a postdoctoral research scientist at UC Berkeley and ICSI and a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. Raquel Urtasun completed her PhD at the Computer Vision Laboratory, at EPFL, Switzerland in 2006 working with Pascal Fua and David Fleet at the University of Toronto. She has been area chair of multiple learning and vision conferences (i.e., NIPS, UAI, ICML, ICCV), and served in the committee of numerous international computer vision and machine learning conferences. Her major interests are statistical machine learning and computer vision, with a particular interest in non-parametric Bayesian statistics, latent variable models, structured prediction and their application to semantic scene understanding.

**Note:** A light lunch will be served at 12:30pm

Tue 02/19/13, 10:45am, Hackerman B17**Fast learning algorithms for discovering the hidden structure in data***Daniel Hsu, Microsoft Research*

**Abstract:** A major challenge in machine learning is to reliably and automatically discover hidden structure in data with little or no human intervention. Many of the core statistical estimation problems of this type are, in general, provably intractable for both computational and information-theoretic reasons. However, much progress has been made over the past decade or so to overcome these hardness barriers by focusing on realistic cases that rule out the intractable instances. In this talk, I’ll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The scope of the approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised learning, as well as fast and practical algorithms for large-scale data analysis.

**Bio:** Daniel Hsu is a postdoc at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta; and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.

(an ML talk in the CS Speaker Series)

Mon 02/18/13, 10:00am, HLTCOE (Stieff Building), North Conference room**Symbolic constraints and statistical methods: Use together for best results***Constantine Lignos, University of Pennsylvania*

**Abstract:** Language is a product of the human mind, and language data reflect the mind’s underlying structures, constraints, and limitations. Many of these restrictions appear to be symbolic and may not have obvious statistical analogues. In this talk, I will present my research into two areas where taking advantage of insights into linguistic and cognitive structure has enabled the development of computationally efficient solutions that combine the strengths of symbolic and statistical methods. (1) Language and codeswitching identification: short, often mixed-language messages such as those on Twitter pose obvious challenges for language identification systems designed for single language documents. I’ll present Codeswitchador, a system I developed as a part of SCALE 2012 which can accurately perform word-by-word language identification in short messages and identify codeswitching in large scale data sets. I’ll discuss the application of the system to construct the first large-scale corpus of Spanish/English codeswitched tweets and evaluate previous linguistic claims made regarding preferred contexts and structural constraints on codeswitching. (2) Infant word segmentation: During the first year of life, infants begin to segment words from a continuous stream of sounds. While previous computational models have proposed possible statistical solutions to word segmentation, these models make no attempt to be cognitively plausible or reflect infants’ development. I’ll review previous adult and infant word segmentation experiments and draw on that work to motivate an efficient, cognitively-oriented, online-learning word segmentation model. I’ll demonstrate that it performs well and displays characteristics of children’s changes in performance and error patterns. This line of research demonstrates that taking advantage of linguistic structure in conjunction with large scale data can lead to the development of high-performing, computationally efficient solutions for natural language problems.

**Bio:** Constantine Lignos is a PhD student in the University of Pennsylvania Computer and Information Science department. His research focuses on efficient approaches to unsupervised and resource-constrained language processing applications. His main areas of research include unsupervised learning of words and word structure, modeling cognitive language processes, and natural language understanding for robotics. He has built a number of language learning and understanding systems with cognitive underpinnings: MORSEL, a rule-based unsupervised morphological analyzer, CATS, an efficient and cognitively plausible infant word segmentation model, SLURP, a system for natural language understanding for human-robot interaction, and Codeswitchador, a system for language and codeswitching identification at the word level developed as a part of SCALE 2012. Before starting graduate school, he received a B.A. in Computer Science and Psychology from Yale and worked at Microsoft on speech recognition, text-to-speech, and dialog systems for automotive platforms, contributing to the Ford Sync and Kia UVO products.

(an ML talk in the HLTCOE Speaker Series)

Tue 02/12/13, 10:45am, Hackerman B17**Learning with Humans in the Loop***Yisong Yue, Carnegie Mellon University*

**Abstract:** Making sense of digital information is a growing problem in almost every domain, ranging from scientists needing to stay current with new research, to companies aiming to provide the best service for their customers. What is common to such “Big Data” problems is not only the scale of the data, but also the complexity of the human processes that continuously interact with digital environments and generate new data. Focusing on learning problems that arise in retrieval and recommender systems, I will show how we can develop principled approaches that explicitly model the process of continuously learning with humans in the loop while improving system utility. As one example, I will present the linear submodular bandits problem, which jointly addresses the challenges of selecting optimally diversified recommendations and balancing the exploration/exploitation tradeoff when personalizing via user feedback. More generally, I will show how to integrate the collection of training data with the user’s use of the system in a variety of applications, ranging from long-term optimization of personalized recommender systems to disambiguation within a single search session.

**Bio:** Yisong Yue is a postdoctoral researcher in the Machine Learning Department and the iLab at Carnegie Mellon University. His research interests lie primarily in machine learning approaches to structured prediction and interactive systems, with an application focus in problems pertaining information systems. He received a Ph.D. from Cornell University and a B.S. from the University of Illinois at Urbana-Champaign. He is the author of the SVM-map software package for optimizing mean average precision using support vector machines. His current research focuses on machine learning approaches to diversified retrieval and interactive information retrieval.

(an ML talk in the CS Speaker Series)

Tue 02/05/13, 10:45am, Hackerman B17**Stochastic approximation algorithms for large-scale unsupervised learning***Raman Arora, TTI Chicago*

**Abstract:** The nature of signal processing and machine learning has evolved dramatically over the years as we try to investigate increasingly intricate, dynamic and large-scale systems. This development is accompanied by an explosion of massive, unlabeled, multimodal, corrupted and very high-dimensional “big data”, which poses new challenges for efficient analysis and learning. In this talk, I will advocate a learning approach based on “stochastic approximation”, wherein a single data point is processed at each iteration using a computationally simple update, to address these challenges. I will start by presenting a stochastic approximation (SA) meta-algorithm for unsupervised learning with large high-dimensional datasets. I will then describe the application of the SA algorithm to a multiview learning framework, where multiple modalities are available at the time of training but not for prediction at test time, and a similarity-based learning framework where data is observed only in the form of pairwise similarities. I will conclude with a theoretical analysis of the SA algorithm and a discussion about the pitfalls of SA approaches and the remedies thereof.

**Bio:** Raman Arora received his B.E. degree from NSIT, Delhi, India, in 2001, and M.S. and Ph.D. degrees from the University of Wisconsin-Madison in 2005 and 2009, respectively. He worked as a Research Associate at University of Washington, Seattle, from 2009 to 2011 and was a visiting researcher at Microsoft Research (MSR) during the summer of 2011. He is currently a Postdoctoral Researcher at the Toyota Technological Institute at Chicago. His research interests include online learning, large-scale machine learning, speech recognition and statistical signal processing.

(an ML talk in the CS Speaker Series)

Fri 12/14/12, 01:00pm, Department of Political Science, Yale University**Model Assisted Causal Inference: Theory and Applications***Peter M. Aronow, Department of Political Science, Yale University*

**Abstract:** I present a general framework for causal inference in randomized experiments conducted on finite populations. I propose a class of sampling-theoretic estimators of average causal effects for arbitrary experimental designs. I further demonstrate how to estimate causal parameters under interference between units or generalized forms of noncompliance. A number of results relevant to applications in field experimentation are presented.

(an ML talk in the Biostatistics Speaker Series)

Wed 12/12/12, 04:00pm, Room W2030 School of Public Health**Data Analysis: Best Practices and Future Directions***Hadley Wickham, Rice University, Statistics Department*

**Abstract:** What are best practices and what will data analysis look like in 10 years time? I’ll start by discussing what I think are current best practices for data analysis (combining ideas from good science and software development), then look at how things might change in the near and not-so-near future. I’ll highlight today’s projects that I think are really exciting (Rstudio, D3, Amazon’s EC2), and do a little blue-sky speculation about what’s on the horizon. I’ll discuss the new field of “data science” and give some hints about what technologies you should be learning next.

**Note:** Refreshments at 3:30pm

(an ML talk in the Biostatistics Speaker Series)

Thu 12/06/12, 01:30pm, Hodson Hall 210**Attribution of Extreme Climatic Events***Richard L. Smith, University of North Carolina, Chapel Hill*

**Abstract:** Superstorm Sandy is merely the most recent high-impact weather event to raise concerns about extreme weather events becoming more frequent or more severe. Previous examples include the western European heatwave of 2003, the Russian heatwave and the Pakistan floods of 2010, and the Texas heatwave of 2011. However, it remains an open question to what extent such events may be “attributed” to human influences such as increasing greenhouse gases. One way to answer this question is to run climate models under two scenarios, one including all the anthropogenic forcing factors (in par cular, greenhouse gases) while the other is run only including the natural forcings (e.g. solar fluctuations) or control runs with no forcings at all. Based on the climate model runs, probabilities of the extreme event of interest may be computed under both scenarios, followed by the risk ratio or the “fraction of attributable risk”, which has become popular in the climatology community as a measure of the human influence on extreme events. This talk will discuss sta s cal approaches to these quantities, including the use of extreme value theory as a method of quantifying the risk of extreme events, and Bayesian hierarchical models for combining the results of different climate models. This is joint work with Xuan Li (UNC) and Michael Wehner (Lawrence Berkeley Lab).

**Bio:** Richard L. Smith is Mark L. Reed III Distinguished Professor of Statistics and Professor of Biostatistics in the University of North Carolina, Chapel Hill. He is also Director of the Statistical and Applied Mathematical Sciences Institute, a Mathematical Sciences Institute supported by the National Science Foundation. He obtained his PhD from Cornell University and previously held academic positions at Imperial College (London), the University of Surrey (Guildford, England) and Cambridge University. His main research interest is environmental statistics and associated areas of methodological research such as spatial statistics, time series analysis and extreme value theory. He is particularly interested in statistical aspects of climate change research, and in air pollution including its health effects. He is a Fellow of the American Statistical Association and the Institute of Mathematical Statistics, an Elected Member of the International Statistical Institute, and has won the Guy Medal in Silver of the Royal Statistical Society, and the Distinguished Achievement Medal of the Section on Statistics and the Environment, American Statistical Association. In 2004 he was the J. Stuart Hunter Lecturer of The International Environmetrics Society (TIES). He is also a Chartered Statistician of the Royal Statistical Society.

(an ML talk in the AMS Speaker Series)

Wed 11/28/12, 04:00pm, Room W2030 School of Public Health**Images as Predictors in Regression Models with Scalar Outcomes***R. Todd Ogden, Columbia University*

**Abstract:** One situation that arises in the field of functional data analysis is the use of imaging data or other very high dimensional data as predictors in regression models. A motivating example involves using baseline images of a patient’s brain to predict the patient’s clinical outcome. Interest lies both in making such patient-specific predictions and in understanding the relationship between the imaging data and the outcome. Obtaining meaningful fits in such problems requires some type of dimension reduction but this must be done while taking into account the particular (spatial) structure of the data. This talk will describe some of the general tools that have proven effective in this context, including principal component analysis, penalized splines, and wavelet analysis.

(an ML talk in the Biostatistics Speaker Series)

Tue 11/27/12, 12:00pm, Hackerman B17**Bridging the Gap: From Sounds to Words***Micha Elsner, Ohio State University*

**Abstract:** During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation– for instance “you” might be realized as ‘you’ with a full vowel or reduced to ‘yeh’ with a schwa. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. I will present ongoing research on constructing a Bayesian model which can simultaneously group together phonetic variants of the same lexical item, learn a probabilistic language model predicting the next word in an utterance from its context, and learn a model of pronunciation variability based on articulatory features. I will discuss a model which takes word boundaries as given and focuses on clustering the lexical items (published at ACL 2012). I will also give preliminary results for a model which searches for word boundaries at the same time as performing the clustering.

**Bio:** Micha Elsner is an Assistant Professor of Linguistics at the Ohio State University, where he started in August. He completed his PhD in 2011 at Brown University, working on models of local coherence. He then worked on Bayesian models of language acquisition as a postdoctoral researcher at the University of Edinburgh.

(an ML talk in the CLSP Speaker Series)

Thu 11/15/12, 01:30pm, Wolf St building Rm W4013**Through the Lens of Search Logs: Studies of the Online Pursuit of Healthcare Information***Eric Horvitz, Microsoft Research*

**Abstract:** I will present studies that explore at scale how people pursue healthcare information on the Web. The studies highlight methods for learning about information-seeking behavior from anonymized search logs, including the construction of predictive models from log data. I will focus in particular on the use of the Web for self-diagnosis, where people use search engines as a diagnostic system. Results highlight systematic problems with the widespread use of the web for diagnosis. I will discuss how biases in content and indexing can interact with cognitive biases of judgment to fuel such phenomena as heightened anxiety or “cyberchondria.” I will also present analyses that explore links between web activity and healthcare utilization by identifying transitions between diagnosis-centric search and queries for local medical assistance. Finally, I will discuss recent work on privacy-sensitive studies of search from location-enabled mobile devices, and describe inferences about transitions from searching on symptomatology to healthcare utilization. Joint work with Ryen White.

**Note:** CPHIT/CSHOR/DHSI seminar

Thu 11/15/12, 01:30pm, Whitehead 304**Preconditioning for Consistency in Sparse Inference***Karl Rohe, University of Wisconsin*

**Abstract:** Preconditioning is a technique from numerical linear algebra that can accelerate algorithms to solve systems of equations. This talk will discuss how preconditioning can also improve the statistical estimation performance in sparse high dimensional regression (aka compressed sensing). Specifically, the talk will demonstrate how preconditioning can circumvent three stringent assumptions for various types of consistency in sparse linear regression. Given X ^{n x p} and Y ^n that satisfy the standard linear regression equation Y = X beta + epsilon, this paper demonstrates that even if the design matrix X does not satisfy the irrepresentable condition, the restricted eigenvalue condition, or the restricted isometry property, the design matrix FX often does, where F ^{n x n} is a specific preconditioning matrix that will be defined in the talk. By computing the Lasso on (FX, FY), instead of on (X,Y), the necessary assumptions on X become much less stringent. Crucially, left multiplying the regression equation by X does not change beta, the vector of unknown coefficients. Our preconditioner F ensures that the singular values of the design matrix are either zero or one. When n>p, the columns of FX are orthogonal and the preconditioner always circumvents the stringent assumptions. When p > n, F projects the design matrix onto the Stiefel manifold; the rows of FX are orthogonal. The Stiefel manifold is a bounded set and we show that most matrices in this set satisfy the stringent assumptions. Simulation results are particularly promising. This is joint work with Jinzhu Jia at Peking University.

(an ML talk in the AMS Speaker Series)

Thu 11/15/12, 10:45am, Hackerman B17**From Data to Decisions: On Learning, Prediction, and Action in the Open World***Eric Horvitz, Microsoft Research*

**Abstract:** A confluence of advances has led to an inflection in our ability to collect, store, and harness large amounts of data for generating insights and guiding decision making in the open world. Beyond study and refinement of principles, fielding real-world systems is critical for testing the sufficiency of algorithms and implications of assumptions-and exploring the human dimension of computational solutions and services. I will discuss several efforts pushing on the frontiers of machine learning and inference, highlighting key ideas in the context of projects in healthcare, transportation, and citizen science. Finally, I will describe directions with the composition of systems that draw upon a symphony of competencies and that operate over extended periods of time.

**Bio:** Eric Horvitz is a Distinguished Scientist at Microsoft Research. His interests span theoretical and practical challenges with developing systems that perceive, learn, and reason, with a focus on inference and decision making under uncertainty and limited resources. He has been elected a Fellow of the AAAI, the AAAS, and of the American Academy of Arts and Sciences. He received PhD and MD degrees at Stanford University. More information about his research, collaborations, and publications can be found at http://research.microsoft.com/~horvitz.

(an ML talk in the CS Speaker Series)

Tue 11/13/12, 12:00pm, Hackerman B17**From Bases to Exemplars, and From Separation to Understanding***Paris Smaragdis, University of Illinois at Urbana-Champaign*

**Abstract:** Audio source separation is an extremely useful process but most of the time not a goal by itself. Even though most research focuses on better separation quality, ultimately separation is needed so that we can perform tasks such as noisy speech recognition, music analysis, single-source editing, etc. In this talk I’ll present some recent work on audio source separation that extends the idea of basis functions to that of using ‘exemplars’ and then builds off that idea in order to provide direct computation of some of the above goals without having to resort to an intermediate separation step. In order to do so I’ll discuss some of the interesting geometric properties of mixed audio signals and how one can employ massively large decommissions with aggressive sparsity settings in order to achieve the above results.

**Bio:** Paris Smaragdis is faculty in the Computer Science and the Electrical and Computer Science departments at the University of illinois at Urbana-Champaign. He completed his graduate and postdoctoral studies at MIT, where he conducted research on computational perception and audio processing. Prior to the University of Illinois he was a senior research scientist at Adobe Systems and a research scientist at Mitsubishi Electric Research Labs, during which time he was selected by the MIT Technology Review as one of the top 35 young innovators of 2006. Paris’ research interests lie in the intersection of machine learning and signal processing, especially as they apply to audio problems.

(an ML talk in the CLSP Speaker Series)

Wed 11/07/12, 04:00pm, Room W2030 School of Public Health**More Robust Doubly Robust Estimation***Anastasios (Butch) Tsiatis, North Carolina State University*

**Abstract:** Considerable recent interest has focused on so-called doubly robust estimators for parameters in a model for full, intended data when some data may be missing. In the simplest case of where the full data consist of an outcome Y and covariates X and interest focuses on the population mean of Y, these estimators involve models for both the propensity score, the probability that Y is observed given X, and the regression of outcome on covariates. These estimators have the appealing property that they are consistent for the true population mean even if one of the outcome regression or propensity score models, but not both, is misspecified. However, despite this appealing property, the “usual” doubly robust estimator may yield severely biased inferences if neither of these models is correctly specified and can exhibit nonnegligible bias if the estimated propensity score is close to zero for some observations, and hence has been criticized as not suitable for practical use. We review doubly robust estimation and propose alternative doubly robust estimators that achieve comparable or improved performance relative to competing methods, which we motivate in this simple setting.

(an ML talk in the Biostatistics Speaker Series)

Thu 10/25/12, 01:30pm, Whitehead 304**Spatial Biclustering and Nonlinear Modeling of a Complex Data Set***Alan Izenman, Temple University*

**Abstract:** Using a novel database, ProDES, developed by the Crime and Justice Research Center at Temple University, this article investigates the relationship between spatial characteristics and juvenile delinquency and recidivism — the proportion of delinquents who commit crimes following completion of a court-ordered program — in Philadelphia, Pennsylvania. ProDES was originally a case-based sample, where the cases had been adjudicated in family court between 1994 and 2004. For our analysis, we focused attention on studying 6,768 juvenile males from the data set. To address the difficult issue of nonstationarity in the data, we apply the plaid biclustering algorithm in which a sequence of subsets (“layers”) of both juveniles and variables are extracted from the data one layer at a time, but where the layers are allowed to overlap with each other. This type of “biclustering” is a new way of studying juvenile offense data. We show that the juveniles within each layer can be viewed as spatially clustered. Statistical relationships of the variables and juveniles within each layer are then studied using neural network models. Results show substantial improvements in predicting juvenile recidivism using the methods of this paper.

(an ML talk in the AMS Speaker Series)

Tue 10/23/12, 12:00pm, Hackerman B17**New Waves of Innovation in Large-Scale Speech Technology Ignited by Deep Learning***Li Deng, Microsoft Research*

**Abstract:** Semantic information embedded in the speech signal manifests itself in a dynamic process rooted in the deep linguistic hierarchy as an intrinsic part of the human cognitive system. Modeling both the dynamic process and the deep structure for advancing speech technology has been an active pursuit for over more than 20 years, but it is only within past two years that technological breakthrough has been created by a methodology commonly referred to as “deep learning”. Deep Belief Net (DBN) and the related deep neural nets are recently being used to supersede the Gaussian mixture model component in HMM-based speech recognition, and has produced dramatic error rate reduction in both phone recognition and large vocabulary speech recognition of industry scale while keeping the HMM component intact. On the other hand, the (constrained) Dynamic Bayesian Networks have been developed for many years to improve the dynamic models of speech aimed to overcome the IID assumption as a key weakness of the HMM, with a set of techniques commonly known as hidden dynamic/trajectory models or articulatory-like segmental representations. A history of these two largely separate lines of research will be critically reviewed and analyzed in the context of modeling the deep and dynamic linguistic hierarchy for advancing speech recognition technology. The first wave of innovation has successfully unseated Gaussian mixture model and MFCC-like features — two of the three main pillars of the 20-year-old technology in speech recognition. Future directions will be discussed and analyzed on supplanting the final pillar — HMM — where frame-level scores are to be enhanced to dynamic-segment scores through new waves of innovation capitalizing on multiple lines of research that has enriched our knowledge of the deep, dynamic process of human speech.

**Bio:** Li Deng received the Ph.D. from Univ. Wisconsin-Madison. He was an Assistant (1989-1992), Associate (1992-1996), and Full Professor (1996-1999) at the University of Waterloo, Ontario, Canada. He then joined Microsoft Research, Redmond, where he is currently a Principal Researcher and where he received Microsoft Research Technology Transfer, Goldstar, and Achievement Awards. Prior to MSR, he also worked or taught at Massachusetts Institute of Technology, ATR Interpreting Telecom. Research Lab. (Kyoto, Japan), and HKUST. He has published over 300 refereed papers in leading journals/conferences and 3 books covering broad areas of human language technology and machine learning. He is a Fellow of the Acoustical Society of America, a Fellow of the IEEE, and a Fellow of the International Speech Communication Association. He is an inventor or co-inventor of over 50 granted US, Japanese, or international patents. Recently, he served as Editor-in-Chief for IEEE Signal Processing Magazine (2009-2011), which ranked first in year 2010 and 2011 among all 247 publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor, and for which he received the 2011 IEEE SPS Meritorious Service Award. He currently serves as Editor-in-Chief for IEEE Transactions on Audio, Speech and Language Processing. His technical work over the past three years brought the power of deep learning into the speech recognition and signal processing fields.

(an ML talk in the CLSP Speaker Series)

Thu 10/18/12, 01:30pm, Whitehead 304**Statistical Inference on Errorfully Observed Graphs***Carey Priebe, JHU*

**Abstract:** Statistical inference on graphs is a burgeoning field in the applied and theoretical statistics communities, as well as throughout the wider world of science, engineering, business, etc. In many applications, we are faced with the reality of errorfully observed graphs. That is, the existence of an edge between two vertices is based on some imperfect assessment. In this paper, we consider a graph G = (V,E). We wish to perform an inference task — the surrogate inference task considered here is “vertex classification”. However, we do not observe G; rather, for each potential edge uv we observe an “edge-feature” which we use to classify uv as edge/not-edge. Thus we errorfully observe G when we observe the graph G’ = (V,E’). Moreover, we face a quantity/quality trade-ff regarding the edge features we observe — more informative edge-features are more expensive, and hence the number of potential edges that can be assessed decreases with the quality of the edge-features. We derive the optimal quantity/quality operating point for subsequent graph inference in the face of this trade-off. This is work in progress, joint with a cast of dozens, sponsored by JHU HLT COE & NSSEFF & DARPA.

(an ML talk in the AMS Speaker Series)

Thu 10/18/12, 10:45am, Hackerman B17**Tera-Scale Deep Learning***Quoc Le, Stanford University*

**Abstract:** Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and NLP. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a very large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature. Such network, when applied at the huge scale, is able to learn abstract concepts in a much more general manner than previously demonstrated. Specifically, we find that by training on 10 million unlabeled images, the network produces features that are very selective for high-level concepts such as human faces and cats. Using these features, we also obtain significant leaps in recognition performance on several large-scale computer vision tasks.

**Bio:** Quoc Le is a PhD student at Stanford and software engineer at Google. At Stanford and Google, Quoc works on large scale brain simulation using unsupervised feature learning and deep learning. His recent work was widely distributed and discussed on various technology blogs and news sites. Quoc obtained his undergraduate degree at Australian National University, and was research visitors at National ICT Australia, Microsoft Research and Max Planck Institute of Biological Cybernetics. Quoc won the best paper award as ECML 2007.

(an ML talk in the CS Speaker Series)

Wed 10/10/12, 03:30pm, Room W2030 School of Public Health**GaPMM: Modeling Mixtures of Trajectory Classes in the Presence of Informative Missingness***Dr. Rebecca Nugent, Carnegie-Mellon University*

**Abstract:** Longitudinal studies are a common research design in the biomedical, biobehavioral, and social sciences, accounting more accurately for changes over time within a population. An unfortunate obstacle in working with repeated measurements is missing data which can come from subject attrition or (possibly sporadic) missing measurements at different time points in the study. This missingness is most often dependent on some unobserved variables and, if ignored, can result in biased estimation procedures. There are also several applications in which we might be interested in characterizing the types of trajectories seen in a population – for example, patients’ different patterns of recovery. In practice, modeling a population as a weighted mixture of trajectory classes can provide more detailed information about subgroup behavioral changes over time. While existing methodology can separately model trajectory (or growth) mixtures and different missingness pattern mixtures, the combination of the two is not as common. Our goal is to analyze the group structure in longitudinal data while incorporating the possibility of informative missingness with an eye toward not only improving the fit of the growth mixture model but also increasing the accuracy of “early” trajectory prediction/classification. For example, can we predict a patient’s pattern of recovery from depression after a few clinical visits even if (s)he has missed some appointments? Better trajectory classification hopefully will lead to earlier interventions for ineffective therapies or treatment arms. Applications will be shown for data sets from clinical depression studies. If time permits, a new agreement index for comparing classifications in the presence of overlapping group trajectories will be discussed.

(an ML talk in the Biostatistics Speaker Series)

Thu 10/04/12, 01:30pm, Whitehead 304**The Alternating Direction Method of Multipliers***Wotao Yin, Rice University*

**Abstract:** Through examples, we argue that the alternating direction method of multipliers (ADMM) is suitable for problems arising in image processing, conic programming, machine learning, compressive sensing, as well as distributed data processing, and the method is able to solve very large scale problem instances. The development of this method dates back to the 1950s and has close relationships with the Douglas-Rachford splitting, the augmented Lagrangian method, Bregman iterative algorithms, proximal-point algorithms, etc. This “old” method has recently become popular among researchers in image/signal processing, machine learning, and distributed/decentralized computation. After a brief overview, we explain its convergence behavior and then demonstrate how to solve very large scale conic programming problems and machine learning problems such as LASSO by ADMM.

(an ML talk in the AMS Speaker Series)

Tue 10/02/12, 10:45pm, Hackerman B17**Novel Probabilistic Priors for Unsupervised Discovery and Prediction from Time Series Data***Suchi Saria, JHU*

**Abstract:** Large amounts of multivariate time series data are now being collected for tracking phenomena that evolve over time. Approaches that can incorporate expert biases to discover informative representations of such data are valuable in enabling exploratory analysis and feature construction. In this talk, we will discuss priors to cluster time series that leverage two different sets of assumptions frequently present in real world data. First, we want to identify repeating “shapes”. For example, when we walk vs. kick, our joint angles produce different repeating shapes over time. How do we discover such repeating shapes in the absence of any labeled data? The second assumption we will tackle is when the generated data does not come from a fixed set of one or two classes as is the case in binary prediction. For example, in clinical data, no two patients are alike. Assuming that generated data is sampled from an m-class distribution can inappropriately bias your learned model. I will show results from multiple domains but will focus on a novel application of modeling physiologic data from monitoring infants in the ICU. Insights gained from our exploratory analysis led to a novel risk prediction score that combines patterns from continuous physiological signals to predict which infants are at risk for developing major complications. This work was published on the cover of Science Translational Medicine (Science’s new journal aimed at translational medicine work), and was covered by numerous national and international press sources.

**Bio:** Suchi Saria recently joined Johns Hopkins as an Assistant Professor in the departments of Computer Science and Health Policy & Management. She received her PhD in Computer Science from Stanford University working with Prof. Daphne Koller. She has won various awards including, a Best Student Paper awards, the Rambus Fellowship, the Microsoft full scholarship and the National Science Foundation Computing Innovation Fellowship. Prof. Saria’s interests are in graphical models, machine learning with applications in modeling complex temporal systems. In particular, she wants to help solve the trillion dollar question of how to fix our health care system and show how these approaches can help us improve the delivery of healthcare.

(an ML talk in the CS Speaker Series)

Tue 10/02/12, 12:00pm, Hackerman B17**Making Computers Good Listeners***Joseph Keshet, TTI Chicago*

**Abstract:** A typical problem in speech and language processing has a very large number of training examples, is sequential, highly structured, and has a unique measure of performance, such as the word error rate in speech recognition, or the BLEU score in machine translation. The simple binary classification problem typically explored in machine learning is no longer adequate for the complex decision problems encountered in speech and language applications. Binary classifiers cannot handle the sequential nature of these problems, and are designed to minimize the zero-one loss, i.e., correct or incorrect, rather than the desired measure of performance. In addition, the current state-of-the-art models in speech and language processing are generative models that capture some temporal dependencies, such as Hidden Markov Models (HMMs). While such models have been immensely important in the development of accurate large-scale speech processing applications, and in speech recognition in particular, theoretical and experimental evidence have led to a wide-spread belief that such models have nearly reached a performance ceiling. In this talk, I first present a new theorem stating that a general learning update rule directly corresponds to the gradient of the desired measure of performance. I present a new algorithm for phoneme-to-speech alignment based on this update rule, which surpasses all previously reported results on a standard benchmark. I show a generalization of the theorem to training non-linear models such as HMMs, and present empirical results on phoneme recognition task which surpass results from HMMs trained with all other training techniques. I will then present the problem of automatic voice onset time (VOT) measurement, one of the most important variables measured in phonetic research and medical speech analysis. I will present a learning algorithm for VOT measurement which outperforms previous work and performs near human inter-judge reliability. I will discuss the algorithm’s implications for tele-monitoring of Parkinson’s disease, and for predicting the effectiveness of chemo-radiotherapy treatment of head and neck cancer.

**Bio:** Joseph Keshet received his B.Sc. and M.Sc. degrees in Electrical Engineering in 1994 and 2002, respectively, from Tel Aviv University. He received his Ph.D. in Computer Science from The School of Computer Science and Engineering at The Hebrew University of Jerusalem in 2007. From 1995 to 2002 he was a researcher at IDF, and won the prestigious Israeli award, “Israel Defense Prize”, for outstanding research and development achievements. From 2007 to 2009 he was a post-doctoral researcher at IDIAP Research Institute in Switzerland. From 2009 He is a research assistant professor at TTI-Chicago, a philanthropically endowed academic computer science institute within the campus of university of Chicago. Dr. Keshet’s research interests are in speech and language processing and machine learning. His current research focuses on the design, analysis and implementation of machine learning algorithms for the domain of speech and language processing.

(an ML talk in the CLSP Speaker Series)

Thu 09/27/12, 01:30pm, Whitehead 304**Statistical Modeling of Social Network Dynamics in Relational Event Histories with Multiple Recipients***Josh Lospinoso, RedOwl Analytics*

**Abstract:** stochastic model is proposed for inferring about the evolving social network dynamics embedded in relational event histories with multiple recipients. The model builds on the seminal work of Butts (2008) and Brandes, Lerner, and Snijders (2009) by generalizing the models to the case of multiple recipients. Relational events are assumed to be a stochastic process whereby series of sequential decisions are undertaken by the senders of the relational events. An initial recipient is chosen according to some preference from among some valid set of candidates. The sender may then decide to add another recipient according to some preference. Addition of recipients proceeds until the sender is satiated. The ordering of the recipient choices is unobserved. Inference in this latent variable problem is treated with a Markov Chain Monte Carlo maximum likelihood estimation approach based on expectation maximization. Models are fit to two exemplar datasets: the Enron email corpus and the Eisenhower Leadership Development Program Dataset collected by Dr. Kate Coronges at the US Military Academy at West Point. C. T. Butts. A relational event framework for social action. Sociological Methodology, 381(1):155–200, 2008. U. Brandes, J. Lerner, and T. A. B. Snijders. Networks evolving step by step: Sta- tistical analysis of dyadic event data. In ASONAM’09, pages 200–205, 2009.

(an ML talk in the AMS Speaker Series)

Thu 09/20/12, 10:45am, Hackerman B17**Towards Improving Sampling and Understanding of Biomolecular Information in Molecular Dynamics Calculations***Tom Woolf, JHU School of Medicine*

**Abstract:** Molecular dynamics computer calculations with modern hardware applied to biological molecules can now reach trajectory lengths up to microseconds in length. Yet, to couple these simulations to grand-challenge problems in medicine and biology: genomics, metabolomics, signaling events, and drug design, still more sampling of the conformational space available to biomolecules is needed. This creates challenges for computer algorithms and data structures to go beyond the current approach and find new ways to sample and to organize the information. This talk will describe two algorithms and recent work in data-structures that may address the sampling problems. The dynamic importance sampling method grows out of Monte Carlo importance sampling to dynamically enhance sampling of intermediate conformations between stable states. The effective transfer entropy method may provide reduced dimensionality projections to further enhance sampling. These directions are coupled with suggestions for control and hierarchy in data-structures to aid the more efficient and information-rich collection of molecular trajectories.

**Bio:** Tom has a physics background, with a BS in Physics from Stanford, MS from the University of Chicago, and a PhD from Yale University in Biophysics/Neuroscience. His post-doctoral work was on molecular dynamics simulations of the gramicidin channel and he’s been pursuing molecular dynamics simulations since arriving at Hopkins in 1994. He is currently a Professor at the School of Medicine in the Department of Physiology. His research has pursued studies aimed at structure:function questions for membrane proteins, at relative free energy calculations for ligand binding, and at the kinetics of conformational change. This later research direction has led to his interest in how to most efficiently explore and understand conformational space.

(an ML talk in the CS Speaker Series)

Mon 09/17/12, 04:30pm, Clark Hall 314**Dynamic Models for Human Activity Analysis***Rizwan Chaudhry, JHU*

**Abstract:** Automatic human activity analysis from videos is a very important area of research in computer vision with numerous applications such as surveillance, security, human machine interfaces, sports training, elderly care and monitoring, building smart home and public environments as well as mining web videos for activity based content. From a research point of view, of particular interest are the development of a) rich representations for human motion across several domains, 2) algorithms for tasks such as action recognition and tracking, and c) computationally efficient strategies for performing human activity analysis in very large data sets. In this talk we will address these challenges and propose features and methods for human activity analysis that are very general and can be applied across several domains. The common thread underlying all the methods is the need to explicitly model the temporal dynamics of human motion. We will first propose optical-flow based and medial-axis skeletal time series features to represent human motion in a scene. We will then model the temporal evolution of these feature time-series using dynamical systems with stochastic inputs and develop methods for comparing these dynamical systems for the purpose of human activity recognition. We will then address the issue of human activity tracking by proposing action-specific dynamic templates. The tracking problem will be posed as a joint optimization problem over the location of the person in the scene as well as the internal state of the dynamical system driving the particular activity. Finally, we will propose very fast approximate-nearest neighbor based methods on the space of dynamical systems for analyzing human motion and show that we can perform human activity recognition very efficiently albeit at a cost of slightly decreased accuracy. Our experimental analysis will show that using dynamical-systems based generative models for human activity perform very well in the above-mentioned tasks.

**Bio:** Rizwan Chaudhry received the BSc. (Honors) degree in 2005 with a double major in Computer Science and Mathematics from the Lahore University of Management Sciences in Lahore, Pakistan. Since 2006, he has been pursuing a Ph.D. in the Department of Computer Science at the Johns Hopkins University, where he has been associated with the Vision, Dynamics and Learning Lab under the guidance of Dr. Rene Vidal. His research interests are in the general areas of computer vision and machine learning, and more specifically, modeling dynamic visual phenomena, human activity recognition and tracking.

**Note:** Dissertation Defense

(an ML talk in the CS Speaker Series)

Mon 09/17/12, 12:00pm, Clark Hall 314**Distributed Optimization on Manifolds for Consensus Algorithms and Camera Network Localization***Roberto Tron, JHU*

**Abstract:** n recent years, there has been a surge of interest in networked systems, where a set of nodes or agents uses a communication network to interact and achieve a common goal. For instance, a group of vehicles might desire to coordinate their motion, or find their relative positions using cameras. There has been an effort, especially in the control systems community, to develop distributed algorithms which are particularly tailored to this setting. These algorithm, which are iterative in nature, are simple, do not require any central coordination, share the computation burden equally among all the nodes, and use minimal amounts of memory. The majority of these algorithms assumes that data on which they operate can be represented in an Euclidean space. However, for some applications, the data lies on a non-linear manifold, such as the sphere or the space of rotations. In the first contribution of this thesis, we propose an extension of an existing class of distributed coordination algorithms, called consensus algorithms, to the case where the data lies on a Riemannian manifold with bounded curvature. Our formulation relies only on the intrinsic geometry of the manifold, and we give a theoretical characterization of its convergence properties. In the second contribution of this thesis, we use the results we developed to propose a distributed, image-based algorithm for camera network localization. Using pairs of images to obtain the relative poses between the corresponding pairs of cameras, our algorithm combines all these local measurements to obtain a global localization of the network. Our proposed solution addresses all the peculiarities of the localization problem, and we also give a theoretical analysis of its convergence.

**Note:** Thesis Defense

(an ML talk in the ECE Speaker Series)

Mon 09/17/12, 09:00am, Clark Hall 314**Sparse Modeling for High‐Dimensional Multi‐Manifold Data Analysis***Ehsan Elhamifar, JHU*

**Abstract:** High‐dimensional multi‐manifold data are ubiquitous in many areas of science and engineering, such as machine learning, signal and image processing, computer vision, pattern recognition, bioinformatics, etc. There are three fundamental tasks related to multi‐manifold data: clustering, dimensionality reduction, and classification. While the field of machine learning has seen great advances in these areas, the applicability of current algorithms are limited due to several challenges. First, in many problems, manifolds are spatially close or even intersect, while existing methods work only when manifolds are sufficiently separated. Second, most algorithms require to know the dimensions or the number of manifolds a priori, while in real‐world problems such quantities are often unknown. Third, most existing algorithms have difficulties in dealing with data nuisances, such as noise, outliers, and missing entries, as well as manifolds of different intrinsic dimensions. In this thesis, we present new frameworks based on sparse representation techniques for clustering, dimensionality reduction and classification of multi‐manifold data that effectively address the aforementioned challenges. The key idea behind the proposed algorithms is what we call the self‐expressiveness property of the data. This property states that in an appropriate dictionary built using the given data points that lie in multiple manifolds, a sparse representation of a data point corresponds to selecting other points from the same manifold. Our goal is then to search for such sparse representations and use them in appropriate frameworks to cluster, embed, and classify the data. We propose sparse optimization programs to find such desired representations and develop theoretical guarantees for the success of the proposed algorithms. By extensive experiments on synthetic and real data, we demonstrate that the proposed algorithms significantly improve the state‐of‐the‐art results.

**Note:** Dissertation Defense

(an ML talk in the ECE Speaker Series)

Tue 04/24/12, 01:00pm, Clark 314**Latent Conditional Random Fields for Joint Object Categorization and Segmentation***Rene Vidal, Johns Hopkins University BME*

**Abstract:** Object categorization and segmentation are among the most challenging problems in computer vision. This is because objects often appear in cluttered scenes and exhibit immense variability in their appearance, shape and pose across natural images. While object categorization and segmentation are clearly related problems, e.g., identifying a person is easier if the person is already segmented, most of the existing literature treats these tasks separately. In this talk, I will present a unified approach where object categorization and segmentation are integrated within the same mathematical framework. In our approach, we represent objects in terms of a dictionary of visual words and propose a latent conditional random field (CRF) model in which the observed variables are category labels and the latent variables are visual word assignments. The CRF energy consists of a segmentation cost, a bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, relationships between visual words and object categories, and spatial relationships among visual words. The segmentation, categorization, and dictionary learning parameters are learned jointly using latent structural SVMs, and the segmentation and visual words are inferred jointly using graph cuts. Experiments show that our approach gives a significant improvement in the segmentation accuracy of challenging natural scenes. Joint work with Dheeraj Singaraju and Aastha Jain.

**Bio:** Dr. Vidal received his B.S. degree in Electrical Engineering (highest honors) from the Pontificia Universidad Catolica de Chile in 1997 and his M.S. and Ph.D. degrees in Electrical Engineering and Computer Sciences from the University of California at Berkeley in 2000 and 2003, respectively. He was a research fellow at the National ICT Australia in 2003 and has been a faculty member in the Department of Biomedical Engineering and the Center for Imaging Science of The Johns Hopkins University since 2004.

(an ML talk in the CIS Speaker Series)

Wed 04/11/12, 04:00pm, School of Public Health, Room W2030**Sparsity in Multiple Kernel Learning***Dr. Ming Yuan, Georgia Tech*

**Abstract:** In this talk, we consider the problem of learning a target function that belongs to the linear span of a large number of reproducing kernel Hilbert spaces. Such a problem arises naturally in many practice situations with the ANOVA, the additive model and multiple kernel learning as the most well known and important examples. A couple of regularization techniques to exploit the sparse nature of the problem will be investigated. The optimality and adaptivity of the these methods will be assessed through oracle type inequalities providing bounds on the excess risk of the resulting prediction rule.

Tue 04/10/12, 01:00pm, Clark 314**Predicting Visual Memorability***Aude Oliva, MIT*

**Abstract:** When glancing at a magazine or browsing the Internet, we are continuously exposed to images. Despite this overflow of visual information, humans are extremely good at remembering thousands of pictures along with their visual details. But not all images are created equal. Artists, advertisers and photographers are routinely challenged by the question “what makes an image memorable?”. Our recent work shows that one can predict image memorability, opening a new domain of application at the interface between human cognition and computer vision.

**Bio:** Aude Oliva is an Associate Professor in the Department of Brain and Cognitive Sciences, a Principal Investigator at the Computer Science and Artificial Intelligence Laboratory, and an Investigator of the Athinoula A. Martinos Imaging Center at MIT. After a French baccalaureate in Physics and Mathematics and a B.Sc in Psychology, she received two M. Sc. degrees –in Experimental Psychology, and in Cognitive Science and, a Ph.D from the Institut National Polytechnique of Grenoble, France. Her research in Computational Perception and Cognition builds on the synergy between human and machine vision and how it applies to solving high level recognition problems like understanding scenes and events, perceiving space, recognizing objects, modeling attention and visual memory. In her research programs, she integrates knowledge and tools from image processing, image statistics, computer vision, perception, cognitive psychology and human cognitive neuroscience (fMRI).

(an ML talk in the CIS Speaker Series)

Tue 04/03/12, 10:45am, Hackerman B17**Machine Learning for Complex Social Processes***Hanna Wallach, University of Massachusetts Amherst*

**Abstract:** From the activities of the US Patent Office or the National Institutes of Health to communications between scientists or political legislators, complex social processes—groups of people interacting with each other in order to achieve specific and sometimes contradictory goals—underlie almost all human endeavor. In order draw thorough, data-driven conclusions about complex social processes, researchers and decision-makers need new quantitative tools for exploring, explaining, and making predictions using massive collections of interaction data. In this talk, I will discuss the development of novel machine learning methods for modeling interaction data, focusing on the interplay between theory and practice. I will concentrate on a class of models known as statistical topic models, which automatically infer groups of semantically-related words (topics) from word co-occurrence patterns in documents. These topics can be used to detect emergent areas of innovation, identify communities, and track trends across languages. Until recently, most statistical topic models relied on two unchallenged prior beliefs. I will explain how challenging these beliefs increases the robustness of topic models to the skewed word frequency distributions common in document collections. I will also talk about a) the creation of a publicly-available search tool for National Institutes of Health (NIH) grants, intended to facilitate navigation and discovery of NIH-funded research, and b) a new statistical model of network structure and content for modeling interaction patterns in intra-governmental communication networks. Finally, I will briefly provide an overview of some of my ongoing and future research directions.

**Bio:** Hanna Wallach is an assistant professor in the Department of Computer Science at the University of Massachusetts Amherst. She is one of five core faculty members involved in UMass’s newly-formed computational social science research initiative. Previously, Hanna was a postdoctoral researcher, also at UMass, where she developed Bayesian latent variable models for analyzing complex data regarding communication and collaboration within scientific and technological communities. Her recent work (with Ryan Adams and Zoubin Ghahramani) on infinite belief networks won the best paper award at AISTATS 2010. Hanna has co-organized multiple workshops on Bayesian latent variable modeling and computational social science. Her tutorial on conditional random fields is widely referenced and used in machine learning courses around the world. As well as her research, Hanna works to promote and support women’s involvement in computing. In 2006, she co-founded the annual workshop for women in machine learning, in order to give female faculty, research scientists, postdoctoral researchers, and graduate students an opportunity to meet, exchange research ideas, and build mentoring and networking relationships. In her not-so-spare time, Hanna is a member of Pioneer Valley Roller Derby, where she is better known as Logistic Aggression.

(an ML talk in the CS Speaker Series)

Mon 04/02/12, 09:00am, Room #111 Krieger Hall**Learning and Inference as Computational Strategies for Dealing with Uncertainty***Vikranth Rao-Bejjanki, University of Rochester*

**Abstract:** In this talk I will describe several studies that examine the computational mechanisms underlying human behavior in the domains of perception and cognition. A key consideration is that all perceptual and cognitive tasks involve multiple levels of ambiguity; the goal in these tasks is therefore to infer the best estimate of the task-relevant stimulus. Indeed, from a computational perspective, we can model these tasks as comprising multiple stages of probabilistic inference, performed over data of varying levels of abstraction – from neural firing rates to concepts. In this framework, the goal of processing to a particular level is to infer the variable of interest given uncertain input information and the goal of learning is to improve the quality of that inference. I will present results at the behavioral level showing that human observers combine cues to speech information in a manner consistent with optimal inference and that this strategy allows them to reduce uncertainty in their performance. At the neural level, I will present evidence that the improved performance observed as a result of perceptual learning can be mediated by improved inference in early sensory areas. Finally, I will describe ongoing work on characterizing developmental changes in the strategies used by young children to carry out inference and learning, and future work aimed at improving inference and learning in individual subjects.

**Note:** There will be an “Open” Question and Answer Session later in the day from 5:00-6:00 pm in room #111 Krieger Hall.

(an ML talk in the CogSci Speaker Series)

Fri 03/30/12, 12:00pm, Hackerman B17**Machine Learning in the Loop***John Langford, Yahoo! Research*

**Abstract:** The traditional supervised machine learning paradigm is inadequate for a wide array of potential machine learning applications where the learning algorithm decides on an action in the real world and gets feedback about that action. This inadequacy results in kludgy systems, such as for ad targeting at internet companies or deep systemic mistrust and skepticism such as for personalized medicine or adaptive clinical trials. I will discuss a new formal basis, algorithms, and practical tricks for doing machine learning in this setting.

**Bio:** John Langford is a computer scientist, working as a senior researcher at Yahoo! Research. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor’s degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Previously, he was affiliated with the Toyota Technological Institute and IBM’s Watson Research Center. He is also the author of the popular Machine Learning weblog, hunch.net and the principle developer of Vowpal Wabbit.

(an ML talk in the CS Speaker Series)

Thu 03/29/12, 01:30pm, Room #111 Krieger Hall**Optimal Inference with Limited Cognitive Resources***Dr. Edward Vul, University of California, San Diego*

**Abstract:** How do people make approximately rational inferences despite their limited cognitive resources? In many tasks, across a number of cognitive domains, average human behavior matches ideal Bayesian agents; however, exact Bayesian inference is intractable and must be approximated even in engineering applications such as machine learning and statistics. This intractability poses an even larger challenge for probabilistic models of human cognition: how can people carry out these computations despite their profound limitations in memory and computational speed? We have taken a two- pronged approach to answering this problem by investigating two interrelated questions: What approximate inference strategies do people adopt, and how do people use their limited resources to support complex inferences? I will describe our work suggesting that people *sample* to approximate inference and that they strategically allocate memory and computation time given structured models of specific tasks. I see these as the first steps toward reconciling the Bayesian computational models of human reasoning with known human limitations.

(an ML talk in the CogSci Speaker Series)

Wed 03/28/12, 04:00pm, Room W2030 of the Bloomberg School of Public Health Building**False Discovery Rate Under Arbitrary Dependence***Jianqing Fan, Princeton University*

**Abstract:** Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any genes are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a new methodology based on principal factor approximation, which successfully substracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive the theoretical distribution for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent FDP. This result has important applications in controlling FDR and FDP. Our estimate of FDP compares favorably with Efron (2007)’s approach, as demonstrated by in the simulated examples. Our approach is further illustrated by some real data applications.

(an ML talk in the Biostatistics Speaker Series)

Tue 03/27/12, 12:00pm, Hackerman B17**Linguistic Structure Prediction with AD3***Noah Smith, Carnegie Mellon University*

**Abstract:** In this talk, I will present AD3 (Alternating Directions Dual Decomposition), an algorithm for approximate MAP inference in loopy graphical models with discrete random variables, including structured prediction problems. AD3 is simple to implement and well-suited to problems with hard constraints expressed in first-order logic. It often finds the exact MAP solution, giving a certificate when it does; when it doesn’t, it can be embedded within an exact branch and bound technique. I’ll show experimental results on two natural language processing tasks, dependency parsing and frame-semantic parsing. This work was done in collaboration with Andre Martins, Dipanjan Das, Pedro Aguiar, Mario Figueiredo, and Eric Xing

**Bio:** I am the Finmeccanica Associate Professor of Language Technologies and Machine Learning in the School of Computer Science at Carnegie Mellon University. I received my Ph.D. in Computer Science, as a Hertz Foundation Fellow, from Johns Hopkins University in 2006 and my B.S. in Computer Science and B.A. in Linguistics from the University of Maryland in 2001. My research interests include statistical natural language processing, especially unsupervised methods, machine learning for structured data, and applications of natural language processing. My book, Linguistic Structure Prediction, covers many of these topics. I serve on the editorial board of the journal Computational Linguistics and the Journal of Artificial Intelligence Research and received a best paper award at the ACL 2009 conference. My research group, Noah’s ARK, is supported by the NSF (including an NSF CAREER award), DARPA, Qatar NRF, IARPA, ARO, Portugal FCT, and gifts from Google, HP Labs, IBM Research,and Yahoo Research.

(an ML talk in the CLSP Speaker Series)

Mon 03/12/12, 04:00pm, MBI Library**Computational Symmetry***Yanxi Liu, Penn State University*

**Abstract:** Symmetry is an essential mathematical concept, as well as a ubiquitous observable phenomenon in nature, science and art. Either by evolution or by design, symmetry imparts an efficient coding that makes it universally appealing — recognition of symmetry and regularity is the first step towards capturing the essential structure of a real world problem while minimizing computational redundancy. Automatic symmetry detection from real world (digital) data turns out to be a surprisingly challenging problem that has puzzled researchers in machine intelligence, computer vision, robotics, and computer graphics for the past four decades. Recognizing the fundamental relevance and potential power that computational symmetry affords, we explore a formal and computational characterization of real world symmetry using a group theoretical model. Such a formalization simultaneously facilitates: (1) a robust and comprehensive algorithmic treatment of the whole regularity spectrum; (2) an effective detection scheme for real world symmetries and symmetry groups; and (3) a set of well-defined bases for measuring and discriminating quantified regularities on diverse data sets. In this talk, I will summarize the theoretical background on crystallographic groups, and will illustrate recent results of applications of computational symmetry in: texture analysis/synthesis, tracking, and manipulation; perceptual grouping and 3D modeling of urban scenes from a single view; automatic geo-tagging; and image ‘de-fencing’. I will also discuss evaluation of computer algorithm performance on real world symmetries, and future work suggesting intertwined relations with human perception.

**Note:** This seminar is apart of the David Bodian Seminar Series with the Johns Hopkins University Mind/Brain Institue

Wed 03/07/12, 10:00am, COE (Stieff Building), North Conference room**Variational Bayesian Methods for Unsupervised Latent Factor Models of Text and Audio***Matthew D. Hoffman, Columbia University*

**Abstract:** In this talk, I will discuss variational strategies for fitting two Bayesian models that explain high-dimensional media data in terms of sets of latent factors. The first model, Latent Dirichlet Allocation (LDA), is a popular model of text corpora that learns to represent documents as mixtures of latent “topic” distributions. We develop an online variational Bayes (VB) algorithm for LDA. Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good as or better than those found with batch VB, and in a fraction of the time. The second model, Gamma Process Nonnegative Matrix Factorization (GaP-NMF), is a new Bayesian nonparametric model of audio spectrograms that addresses the problem of latent source discovery and separation in audio recordings. GaP-NMF allows us to discover what sounds (e.g. bass drums, guitar chords, etc.) are present in a recording and to isolate or suppress individual sources. Crucially, this model is able to decide how many latent sources are necessary to model the data. This feature is particularly valuable in this application, since it is impossible to guess a priori how many sounds will appear in a given recording. Although the GaP-NMF model lacks the conditional conjugacy enjoyed by models such as LDA, we are nonetheless able to efficiently fit it to data using a novel variational algorithm.

**Bio:** Matthew D. Hoffman is a postdoctoral researcher in the Department of Statistics at Columbia University. He received his Ph.D. in Computer Science from Princeton University in 2010. His research interests are in machine learning, statistical modeling, audio signal processing, content-based music information retrieval, and the associated computational issues.

**Note:** Note: the Stieff Building is at 810 Wyman Park Drive, a half mile’s walk south of the Homewood campus.

(an ML talk in the HLTCOE Speaker Series)

Tue 03/06/12, 12:00pm, Hackerman B17**Fast, Accurate and Robust Multilingual Syntactic Analysis***Slav Petrov, Google*

**Abstract:** To build computer systems that can ‘understand’ natural language, we need to go beyond bag-of-words models and take the grammatical structure of language into account. Part-of-speech tag sequences and dependency parse trees are one form of such structural analysis thatis easy to understand and use. This talk will cover three topics. First, I will present a coarse-to-fine architecture for dependency parsing that uses linear-time vine pruning and structured prediction cascades. The resulting pruned third-order model is twice as fast as an unpruned first-order model and compares favorably to a state-of-the-art transition-based parser in terms of speed and accuracy. I will then present a simple online algorithm for training structured prediction models with extrinsic loss functions. By tuning a parser with a loss function for machine translation reordering, we can show that parsing accuracy matters for downstream application quality, producing improvements of more than 1 BLEU point on an end-to-end machine translation task. Finally, I will present approachesfor projecting part-of-speech taggers and syntactic parsers across language boundaries, allowing us to build models for languages with no labeled training data. Our projected models significantly outperform state-of-the-art unsupervised models and constitute a first step towards an universal parser. This is joint work with Ryan McDonald, Keith Hall, Dipanjan Das, Alexander Rush, Michael Ringgaard and Kuzman Ganchev (a.k.a. the Natural Language Parsing Team at Google).

**Bio:** Slav Petrov is a Senior Research Scientist in Google’s New York office. He works on problems at the intersection of natural language processing and machine learning. He is in particular interested in syntactic parsing and its applications to machine translation and information extraction. He also teaches a class on Statistical Natural Language Processing at New York University every Fall. Prior to Google, Slav completed his PhD degree at UC Berkeley, where he worked with Dan Klein. He holds a Master’s degree from the Free University of Berlin, and also spent a year as an exchange student at Duke University. Slav was a member of the FU-Fighters team that won the RoboCup 2004 world championship in robotic soccer and recently won a best paper award at ACL 2011 for his work on multilingual syntactic analysis. Slav grew up in Berlin, Germany, but is originally from Sofia, Bulgaria. He therefore considers himself a Berliner from Bulgaria. Whenever Bulgaria plays Germany in soccer, he supports Bulgaria.

(an ML talk in the CLSP Speaker Series)

Tue 03/06/12, 10:45am, Hackerman B17**Scalable Bayesian Learning for Complex Data***Yuan (Alan) Qi, Purdue University*

**Abstract:** Data are being generated at an unprecedented pace in various scientific and engineering areas including biomedical engineering, and materials science, and social science. These data provide us with precious opportunities to reveal hidden relationships in natural or synthetic systems and predict their functions and properties. With growing complexity, however, the data impose new computational challenges—for example, how to handle the high dimensionality, nonlinear interactions, and the massive volume of the data. To address these new challenges, I have been developing advanced Bayesian models to capture the data complexity and designing scalable algorithms to learn the models efficiently from data. In this talk, I will describe three of my recent works along this line: 1) efficient learning of novel nonparametric models on tensors to discover communities in social networks and predict who should be your friends on facebook; 2) parallel inference for new hierarchical Bayesian models to identify rare cancer stem cells from massive single-cell data; and 3) finding repeated network modules in either multiple or a single noisy graph with applications to materials science and systems biology. I will present experimental results on real world data—demonstrating the superior predictive performance of the proposed approaches—and discuss other applications of these approaches such as in patient drug response analysis and neuroscience.

**Bio:** Alan Qi obtained his PhD from MIT in 2005 and worked as a postdoctoral researcher at MIT from 2005 to 2007. In 2007, he joined Purdue University as an assistant professor of Computer Science and Statistics (and Biology by courtesy). He received the A. Richard Newton Breakthrough Research Award from Microsoft Research in 2008, the Interdisciplinary Award from Purdue University in 2010, and the NSF CAREER award in 2011. His research interest lies in scalable Bayesian learning and their applications. His group not only develops sparse, nonparametric, and dynamic learning algorithms, but also collaborates with domain experts— such as biomedical researchers at Purdue, materials scientists at MIT, and psychologists at Toronto University— for a wide range of scientific and engineering applications.

(an ML talk in the CS Speaker Series)

Fri 03/02/12, 12:00pm, Hackerman B17**Efficient Search and Learning for Language Understanding and Translation***Liang Huang, Information Sciences Institute/ University of Southern California*

**Abstract:** What is in common between translating from English into Chinese and compiling C++ into machine code? And yet what are the differences that make the former so much harder for computers? How can computers learn from human translators? This talk sketches an efficient (linear-time) “understanding + rewriting” paradigm for machine translation inspired by both human translators as well as compilers. In this paradigm, a source language sentence is first parsed into a syntactic tree, which is then recursively converted into a target language sentence via tree-to-string rewriting rules. In both “understanding” and “rewriting” stages, this paradigm closely resembles the efficiency and incrementality of both human processing and compiling. We will discuss these two stages in turn. First, for the “understanding” part, we present a linear-time approximate dynamic programming algorithm for incremental parsing that is as accurate as those much slower (cubic-time) chart parsers, while being as fast as those fast but lossy greedy parsers, thus getting the advantages of both worlds for the first time, achieving state-of-the-art speed and accuracy. But how do we efficiently learn such a parsing model with approximate inference from huge amounts of data? We propose a general framework for structured prediction based on the structured perceptron that is guaranteed to succeed with inexact search and works well in practice. Next, the “rewriting” stage translates these source-language parse trees into the target language. But parsing errors from the previous stage adversely affect translation quality. An obvious solution is to use the top-k parses, rather than the 1-best tree, but this only helps a little bit due to the limited scope of the k-best list. We instead propose a “forest-based approach”, which translates a packed forest encoding *exponentially* many parses in a polynomial space by sharing common subtrees. Large-scale experiments showed very significant improvements in terms of translation quality, which outperforms the leading systems in literature. Like the “understanding” part, the translation algorithm here is also linear-time and incremental, thus resembles human translation. We conclude by drawing a few future directions.

**Bio:** Liang Huang is a Research Assistant Professor at University of Southern California (USC), and a Research Scientist at USC’s Information Sciences Institute (ISI). He received his PhD from the University of Pennsylvania in 2008, and worked as a Research Scientist at Google before moving to USC/ISI. His research focuses on efficient search algorithms for natural language processing, esp. in parsing and machine translation, as well as related structured learning problems. His work received a Best Paper Award at ACL 2008, and three Best Paper Nominations at ACL 2007, EMNLP 2008, and ACL 2010.

(an ML talk in the CLSP Speaker Series)

Tue 02/28/12, 01:00pm, Clark 314**Machine Learning Approaches to Reconstructing Neural Wiring Diagrams***Viren Jain, Howard Hughes Medical Institute*

**Abstract:** The brain is composed of networks of connected cells called neurons. The detailed structure of these networks is likely to be a crucial determinant of nervous system behavior and function, and is therefore an attractive target of study for neurobiology research. Obtaining measurements of such structure is in principle straightforward; in one promising approach, 3d images of the brain are collected at nanometer resolution and then the images are reconstructed to find which cells are connected to each other. In practice, however, this is difficult. In particular, automating the segmentation of teravoxel-sized images of brain tissue is a major challenge for current computer vision methods. I will discuss our efforts to automate image analysis in this domain using a combination of machine learning technologies: deep convolutional networks for end-to-end image analysis, novel cost functions and learning algorithms that optimize true measures of segmentation performance, and novel structured prediction approaches for reasoning about large amounts of image context. I will also discuss efforts to combine these machine-learning technologies with crowd-sourcing approaches to manual data annotation.

**Bio:** Viren Jain is a Fellow and Laboratory Head at the Howard Hughes Medical Institute (HHMI) Janelia Farm Research Campus. He completed his undergraduate education at the University of Pennsylvania with degrees in Computer Science and Cognitive Science, followed by a PhD in Computation at the Massachusetts Institute of Technology. At MIT, he worked with Sebastian Seung and collaborators at the Max Planck Institute for Medical Research to develop new algorithms for automated image analysis, with the goal of enabling high-throughput reconstruction of brain connectivity. At Janelia, Viren continues to develop novel algorithmic tools, and applies them to studying specific biological questions that relate neuronal structure to function.

(an ML talk in the CIS Speaker Series)

Tue 02/28/12, 10:45am, Hackerman B17**Probabilistic Programming: Beyond Graphical Models***David Wingate, MIT*

**Abstract:** Over the last 10 years, probabilistic graphical models have become one of the cornerstones of modern machine learning. As science and engineering increasingly turn to them to solve difficult learning and data analysis problems, it is becoming more and more important to provide tools that make advanced statistical inference accessible to a broad community. In this talk I will discuss my work on probabilistic programming, a recent generalization of graphical models: rather than marry statistics with graph theory, probabilistic programming marries Bayesian probability with computer science. It allows modelers to specify a complex generative process using syntax that resembles a modern programming language, easily defining distributions using language features such as data structures, recursion, or native libraries. Scalable inference is the key challenge. I will discuss how we are leveraging concepts from programming language theory (such as monads, anonymous functions, and memoization), as well as compiler design (program analysis, source code transformations, nonstandard interpretations, and code factorization) to both define and implement universal inference algorithms for probabilistic programming languages. I will illustrate the results on a variety of tasks, with emphasis on inversion problems from geophysics.

**Bio:** David Wingate is a research scientist at MIT with a joint appointment in the Laboratory for Information Decision Systems and the Computational Cognitive Science group. He obtained a B.S. and M.S. in Computer Science from Brigham Young University, and a Ph.D. in Computer Science from the University of Michigan. His research focuses on the intersection of probabilistic modeling (with an emphasis on Bayesian nonparametrics), machine learning, dynamical systems modeling, reinforcement learning and probabilistic programming.

(an ML talk in the CS Speaker Series)

Mon 02/27/12, 10:00am, Stieff Building, North Conference room**Statistical Modeling and Learning for Machine Translation***Kevin Gimpel, Carnegie Mellon University*

**Abstract:** Recent years have seen a flurry of research in translation modeling for statistical machine translation. Widely-used approaches include (1) models based on flat phrase-to-phrase mappings, and (2) models that use syntactic structure. I will present an approach that combines the strengths of these two in a single model. Experiments show that it leads to improved translation quality over state-of-the-art systems. When supervised syntactic parsers are not available, I will show that unsupervised parsers can be substituted with minimal loss in translation quality. I will also discuss the problem of tuning model parameters. An abundance of techniques has been developed by the statistical machine learning community, but the machine translation problem is fundamentally different from other structure prediction tasks in key ways. These differences have caused well-known learning algorithms to lose theoretical guarantees when applied to machine translation. I will propose a novel algorithm that does offer theoretical guarantees and shows empirical advantage over state-of-the-art approaches.

**Bio:** Kevin Gimpel is a Ph.D. student in the Language Technologies Institute at Carnegie Mellon University where he is advised by Noah Smith. His research focuses on machine translation, with supporting interests in natural language processing and machine learning. He is also interested in new tasks involving emerging data sources, including social media, movie reviews, and restaurant menus. He interned with the machine translation team at Google during summer 2009 and has been a Sandia National Laboratories Excellence in Science and Technology Fellow since 2010.

**Note:** The Stieff Building is at 810 Wyman Park Drive, a half-mile’s walk south of the Homewood campus.

(an ML talk in the HLTCOE Speaker Series)

Fri 02/24/12, 11:00am, Shaffer 3**Algorithms and Lower Bounds for Sparse Recovery***Eric Price, MIT*

**Abstract:** The goal of /stable sparse recovery/ or /compressive sensing/ is to recover a K-sparse approximation x* of a vector x in R^N from M linear measurements of x. This problem has a wide variety of applications, including streaming algorithms, image acquisition, and genetic testing. A common formulation is to recover x* such that ||x-x*||_2 < (1 + eps) min_{K-sparse x'} ||x-x'||_2 for some constant C with 3/4 probability over the choice of measurements. In this talk, we will give upper and lower bounds for this problem. For the /nonadaptive/ case, where all the measurements must be chosen independent of x, we give a lower bound showing M = Theta((1/eps) K log (N/K)) is optimal. In the /adaptive/ setting, where each measurement may be based on the results of previous measurements, we show that M = O((1/\eps) K log log (N/K)) is possible, an exponential improvement in N. These results contain joint work with David Woodruff and Piotr Indyk, and appeared in FOCS 2011. They are available at http://arxiv.org/abs/1110.4414 and http://arxiv.org/abs/1110.3850 .
**Bio:** Eric Price is a third-year Ph.D. student in MIT CSAIL, interested in algorithms. He mostly works on sparse recovery/compressive sensing. His advisor is Piotr Indyk.

(an ML talk in the CS Speaker Series)

Tue 02/21/12, 12:00pm, Hackerman B17**Bayesian Nonparametric Methods for Complex Dynamical Phenomena***Emily Fox, University of Pennsylvania*

**Abstract:** Markov switching processes, such as hidden Markov models (HMMs) and switching linear dynamical systems (SLDSs), are often used to describe rich classes of dynamical phenomena. They describe complex temporal behavior via repeated returns to a set of simpler models: imagine, for example, a person alternating between walking, running and jumping behaviors, or a stock index switching between regimes of high and low volatility. Traditional modeling approaches for Markov switching processes typically assume a fixed, pre-specified number of dynamical models. Here, in contrast, I develop Bayesian nonparametric approaches that define priors on an unbounded number of potential Markov models. Using stochastic processes including the beta and Dirichlet process, I develop methods that allow the data to define the complexity of inferred classes of models, while permitting efficient computational algorithms for inference. The new methodology also has generalizations for modeling and discovery of dynamic structure shared by multiple related time series. Interleaved throughout the talk are results from studies of the NIST speaker diarization database, stochastic volatility of a stock index, the dances of honeybees, and human motion capture videos.

**Bio:** Emily B. Fox received the S.B. degree in 2004, M.Eng. degree in 2005, and E.E. degree in 2008 from the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology (MIT). She is currently an assistant professor in the Wharton Statistics Department at the University of Pennsylvania. Her Ph.D. was advised by Prof. Alan Willsky in the Stochastic Systems Group, and she recently completed a postdoc in the Department of Statistical Science at Duke University working with Profs. Mike West and David Dunson. Emily is a recipient of the National Defense Science and Engineering Graduate (NDSEG), National Science Foundation (NSF) Graduate Research fellowships, and NSF Mathematical Sciences Postdoctoral Research Fellowship. She has also been awarded the 2009 Leonard J. Savage Thesis Award in Applied Methodology, the 2009 MIT EECS Jin-Au Kong Outstanding Doctoral Thesis Prize, the 2005 Chorafas Award for superior contributions in research, and the 2005 MIT EECS David Adler Memorial 2nd Place Master’s Thesis Prize. Her research interests are in multivariate time series analysis and Bayesian nonparametric methods.

(an ML talk in the CLSP Speaker Series)

Tue 02/21/12, 10:45am, Hackerman B17**Machine Learning in the Bandit Setting: Algorithms, Evaluation, and Case Studies***Lihong Li, Yahoo! Research*

**Abstract:** Much of machine-learning research is about discovering patterns—building intelligent agents that learn to predict future accurately from historical data. While this paradigm has been extremely successful in numerous applications, complex real-world problems such as content recommendation on the Internet often require the agents to learn to act optimally through autonomous interaction with the world they live in, a problem known as reinforcement learning. Using a news recommendation module on Yahoo!’s front page as a running example, the majority of the talk focuses on the special case of contextual bandits that have gained substantial interests recently due to their broad applications. We will highlight a fundamental challenge known as the exploration/exploitation tradeoff, present a few newly developed algorithms with strong theoretical guarantees, and demonstrate their empirical effectiveness for personalizing content recommendation at Yahoo!. At the end of the talk, we will also summarize (briefly) our earlier work on provably data-efficient algorithms for more general reinforcement-learning problems modeled as Markov decision processes.

**Bio:** Lihong Li is a Research Scientist in the Machine Learning group at Yahoo! Research. He obtained a PhD degree in Computer Science from Rutgers University, advised by Michael Littman. Before that, he obtained a MSc degree from the University of Alberta, advised by Vadim Bulitko and Russell Greiner, and BE from the Tsinghua University. In the summers of 2006-2008, he enjoyed interning at Google, Yahoo! Research, and AT&T Shannon Labs, respectively. His main research interests are in machine learning with interaction, including reinforcement learning, multi-armed bandits, online learning, active learning, and their numerous applications on the Internet. He is the winner of an ICML’08 Best Student Paper Award, a WSDM’11 Best Paper Award, and an AISTATS’11 Notable Paper Award.

(an ML talk in the CS Speaker Series)

Mon 02/20/12, 10:00am, Stieff Building, North Conference room**Learning to Efficiently Rank***Lidan Wang, University of Maryland*

**Abstract:** Technological advances have led to increases in the types and amounts of data, and there is a great need for developing methods to manage and find relevant information from such data to satisfy user’s information needs. Learning to rank is an emerging discipline at the intersection of machine learning, data mining, and information retrieval. It develops principled machine learning algorithms to construct ranking (i.e., retrieval) models, for finding and ranking relevant information to user queries over large amounts of data. Although learning to rank approaches are capable of learning highly effective ranking functions, they have mostly ignored the important issue of model efficiency (i.e., model speed). Given that efficiency and effectiveness are competing forces that often counteract each other, models that are optimized for effectiveness alone may not meet the strict efficiency requirements when dealing with real-world large-scale datasets. My Ph.D. thesis introduces the Learning to Efficiently Rank framework for learning large-scale ranking models that facilitate fast and effective retrieval, by exploiting and optimizing the tradeoffs between model complexity (i.e., speed) and accuracy. At a basic level, this framework learns ranking models whose speed and accuracy can be explicitly controlled. I proposed and designed solutions for three problems within this framework: 1) learning large-scale ranking models according to a desired tradeoff between model speed and accuracy; 2) constructing temporally-constrained models capable of returning results under time budgets; 3) breaking through the speed/accuracy tradeoff barrier by developing a novel cascade ranking model, and learning the cascade model structure and parameters with a novel boosting-based learning algorithm. My research extends the conventional effectiveness-centric approach in model learning and takes an efficiency-minded look at building effective retrieval models. Results show that models learned this way significantly outperform traditional machine-learned models in terms of speed without sacrificing result effectiveness. Moreover, the new models work particularly well when users impose stringent time requirements for ranked retrieval on very large datasets.

**Bio:** Lidan Wang is a Ph.D. candidate in the Computer Science Department at the University of Maryland, College Park. She received her Master’s degree from the Department of Computer Science at the University of Wisconsin, Madison, and Bachelor’s degree from the Department of Computer Science at the University of Florida. Lidan’s research interests lie at the intersection of machine learning, information retrieval, and text and data mining. Lidan’s work focuses on designing large-scale machine learning and information retrieval techniques for learning, mining, and retrieving information at scale. Her Ph.D. dissertation research led to a recent NSF research grant (IIS-1144034), which she co-authored.

**Note:** The Stieff Building is at 810 Wyman Park Drive, a half-mile’s walk south of the Homewood campus.

(an ML talk in the HLTCOE Speaker Series)

Fri 02/17/12, 12:00pm, Hackerman B17**Learning to Read the Web***Tom Mitchell, Carnegie Mellon University*

**Abstract:** We describe our efforts to build a Never-Ending Language Learner (NELL) that runs 24 hours per day, forever, learning to read the web. Each day NELL extracts (reads) more facts from the web, and integrates these into its growing knowledge base of beliefs. Each day NELL also learns to read better than yesterday, enabling it to go back to the text it read yesterday, and extract more facts, more accurately. NELL has now been running 24 hours/day for over two years. The result so far is a collection of 15 million interconnected beliefs (e.g., servedWtih(coffee, applePie), isA(applePie, bakedGood) ), that NELL is considering at different levels of confidence, along with hundreds of thousands of learned phrasings, morphoogical features, and web page structures that NELL uses to extract beliefs from the web. The approach implemented by NELL is based on three key ideas: (1) coupling the semi-supervised training of thousands of different functions that extract different types of information from different web sources, (2) automatically discovering new constraints that more tightly couple the training of these functions over time, and (3) a curriculum or sequence of increasing difficult learning tasks. Track NELL’s progress at http://rtw.ml.cmu.edu.

**Bio:** Tom M. Mitchell is the E. Fredkin University Professor and founding head of the Machine Learning Department at Carnegie Mellon University. His research interests lie in machine learning, artificial intelligence, and cognitive neuroscience. Mitchell is a member of the U.S. National Academy of Engineering, a Fellow of the American Association for the Advancement of Science (AAAS), and a Fellow and Past President of the Association for the Advancement of Artificial Intelligence (AAAI). Mitchell believes the field of machine learning will be the fastest growing branch of computer science during the 21st century. His web page is http://www.cs.cmu.edu/~tom.

(an ML talk in the CLSP Speaker Series)

Thu 02/16/12, 03:45pm, Krieger Hall, room #134A**Neural Representations of Word Meanings***Tom Mitchell, Carnegie Mellon University*

**Abstract:** How does the human brain represent meanings of words and pictures in terms of neural activity? This talk will present our research addressing this question, in which we are applying machine learning algorithms to fMRI and MEG brain image data. One line of our research involves training classifiers that identify perceptual and semantic properties of a word a person reads, based on their observed neural activity. A second line involves training computational models that predict the neural activity associated with arbitrary English words, including words for which we do not yet have brain image data. A third line of work involves examining neural activity at millisecond time resolution during the comprehension of words and phrases.

**Bio:** Tom M. Mitchell is the E. Fredkin University Professor and head of the Machine Learning Department at Carnegie Mellon University. His research interests lie in cognitive neuroscience, machine learning, natural language processing, and artificial intelligence. Mitchell is a member of the US National Academy of Engineering, a Fellow of the American Association for the Advancement of Science (AAAS), and Fellow and Past President of the Association for the Advancement of Artificial Intelligence (AAAI). Mitchell believes the field of machine learning will be the fastest growing branch of computer science during the 21st century. His home page is www.cs.cmu.edu/~tom .

**Note:** Refreshments at 3:30pm

(an ML talk in the CogSci Speaker Series)

Thu 02/16/12, 03:00pm, Gilman Hall 132**To Adapt or Not To Adapt: The Power and Limits of Adaptive Sensing***Mark A. Davenport, Stanford University*

**Abstract:** In recent years, the fields of signal processing, statistical inference, and machine learning have come under mounting pressure to accommodate massive amounts of increasingly high-dimensional data. Despite extraordinary advances in computational power, the data produced in application areas such as imaging, remote surveillance, meteorology, genomics, and large scale network analysis continues to pose a number of challenges. Fortunately, in many cases these high-dimensional signals contain relatively little information compared to their ambient dimensionality. For example, signals can often be well-approximated as sparse in a known basis, as a matrix having low rank, or using a low-dimensional manifold or parametric model. Exploiting this structure is critical to any effort to extract information from such data. In this talk I will overview some of my recent research on how to exploit such models to recover high-dimensional signals from as few observations as possible. Specifically, I will primarily focus on the problem of recovering a sparse vector from a small number of noisy measurements. To begin, I will consider the case where the measurements are acquired in a nonadaptive fashion. I will establish a lower bound on the minimax mean-squared error of the recovered vector which very nearly matches the performance of l1-minimization techniques, and hence shows that these techniques are essentially optimal. I will then consider the case where the measurements are acquired sequentially in an adaptive manner. I will prove a lower bound that shows that, surprisingly, adaptivity does not allow for substantial improvement over standard nonadaptive techniques in terms of the minimax MSE. Nonetheless, I will also show that there are important regimes where the benefits of adaptive sensing are clear and overwhelming.

**Bio:** Mark A. Davenport received the B.S.E.E., M.S., and Ph.D. degrees in electrical and computer engineering in 2004, 2007, and 2010, all from Rice University. In 2011 he was a visitor at the Laboratoire Jacques-Louis Lions, Université Pierre et Marie Curie. He is currently an NSF Mathematical Sciences Postdoctoral Research Fellow in the Department of Statistics at Stanford University. His research interests include compressive sensing, low-rank matrix recovery, nonlinear approximation, and the application of low-dimensional signal models in signal processing and machine learning. Dr. Davenport shared the Hershel M. Rich Invention Award from Rice in 2007 for his work on the single-pixel camera and compressive sensing. In 2011 he was awarded the Ralph Budd Thesis Award from Rice.

(an ML talk in the ECE Speaker Series)

Mon 02/13/12, 10:00am, Stieff Building, North Conference room**Interactive Machine Learning: Combining Learning Strategies with Humans in the Loop***Burr Settles, Carnegie Mellon University*

**Abstract:** People learn by interacting with their teachers. Why not machines? What would it take to develop software that can learn how to solve problems by interacting and collaborating with humans? This talk will describe my efforts to develop such systems, with the goal of training effective machine learners more quickly and economically. In particular, I focus on two projects in natural language processing that combine multiple learning strategies: incorporating domain knowledge (taking advice in the form of human-provided rules), active learning (asking “questions” of human annotators), and semi-supervised learning (attempting to “teach itself” by extrapolating what has been learned onto abundant, unlabeled data). Empirical results from user experiments show that these approaches are superior to their state-of-the-art “passive” learning counterparts. Interestingly, these experiments provide initial insights into human “teaching” behavior as well, suggesting ways in which human factors can and should be taken into account. I will also briefly discuss opportunities for interactive learning in other areas, such as supporting online communities, creative work, and biological discovery.

**Bio:** Burr Settles is a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University. He received a PhD in Computer Sciences from the University of Wisconsin-Madison in 2008, with additional studies in Linguistics and Biology. His current research focuses on interactive machine learning that resembles a “dialogue” of decision-making and knowledge acquisition between computers and humans, with applications in natural language processing, biology, and social computing. He recently organized workshops at the ICML and NAACL conferences on these topics, and is the author of a popular literature survey on active learning (active-learning.net). He also runs the website FAWM.ORG, prefers sandals to shoes, and plays guitar in the Pittsburgh pop band Delicious Pastries.

**Note:** The Steiff Building is at 810 Wyman Park Drive, a half-mile’s walk south of the Homewood campus.

(an ML talk in the HLTCOE Speaker Series)

Fri 02/03/12, 04:00pm, School of Public Health W2030**Missing Heritability: New Statistical and Algorithmic Approaches***Or Zuk, Broad Institute (MIT/Harvard)*

**Abstract:** The completion of the human genome project set a stepping stone in building catalogs of common human genetic variation. These catalogs, in turn, enabled the search for associations between common variants and complex human traits and diseases, by performing Genome-Wide Association Studies (GWAS). GWAS have been successful in discovering thousands of statistically significant, reproducible, genotype-phenotype associations. However, the discovered variants (genotypes) explain only a small fraction of the phenotypic variance in the population for most human traits. In contrast, the heritability, defined as the proportion of phenotypic variance explained by all genetic factors, was estimated to be much larger for those same traits using indirect population-based estimators. This gap is referred to as ‘missing heritability’. Mathematically, heritability is defined by considering a function $F$ mapping a set of (Boolean) variables, $(x_1,.., x_n)$ representing genotypes, and additional environmental or ‘noise’ variables $\epsilon$, to a single (real or discrete) variable $z$, representing phenotype. We use the variance decomposition of $F$, separating the linear term, corresponding to additive (narrow-sense) heritability, and higher-order terms, representing genetic-interactions (epistasis), to explore several explanations for the ‘missing heritability’ mystery. We show that genetic interactions can significantly bias upwards current population-based heritability estimators, creating a false impression of ‘missing heritability’. We offer a solution to this problem by providing a novel consistent estimator based on unrelated individuals. We also use the Wright-Fisher process from population genetics theory to develop and apply a novel power correction method for inferring the relative contributions of rare and common variants to heritability. Finally, we propose a novel algorithm for estimating the different variance components (beyond additive) of heritability from GWAS data.

**Bio:** Or Zuk is a postdoctoral researcher at the Broad Institute of MIT and Harvard, in Eric Lander’s group. Previously, he completed a Ph.D. in Computer Science and Applied Mathematics at the Weizmann Institute of Science under the supervision of Eytan Domany. His main research interests are in computational and statistical problems arising from applications in genomics and genetics.

(an ML talk in the Biostatistics Speaker Series)

Wed 02/01/12, 04:00pm, Room W2030 School of Public Health**Inference with Implicit Likelihoods for Infectious Disease Models***Roman Jandarov, Penn State University*

**Abstract:** Probabilistic models for infectious disease dynamics are useful for understanding the mechanism underlying the spread of infection. When the likelihood function for these models is expensive to evaluate, traditional likelihood-based inference may be computationally intractable. Furthermore, traditional inference may lead to poor parameter estimates and the fitted model may not capture important biological characteristics of the observed data. In this talk, I describe a novel approach for resolving these issues that is inspired by recent work in emulation and calibration for complex computer models. Using our motivating example, the! gravity time series susceptible-infected-recovered (TSIR) model for measles dynamics, I demonstrate that the new approach is computationally expedient, provides accurate parameter inference, and results in a good model fit. The approach focuses on the characteristics of the process that are of scientific interest. We find a Gaussian process approximation to the gravity model using key summary statistics obtained from model simulations. The method is widely applicable to problems where traditional likelihood-based inference is computationally intractable or produces a poor model fit. It is also an alternative to approximate Bayesian computation (ABC) when simulations from the model are expensive. I will also discuss how our methodology is useful for inference in mixed membership random graph models for affiliation networks. At the end of the talk I will briefly describe two other projects I have worked on, one on modeling meningitis transmission and! the other on estimating periodicities in gypsy moth outbreaks! .

(an ML talk in the Biostatistics Speaker Series)

Tue 01/31/12, 12:00pm, Hackerman B17**Scalable Topic Models***David Blei, Princeton University*

**Abstract:** Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents. At the center of a topic model is a hierarchical mixed-membership model, where each document exhibits a shared set of mixture components with individual (per-document) proportions. Our goal is to condition on the observed words of a collection and estimate the posterior distribution of the shared components and per-document proportions. When analyzing modern corpora, this amounts to posterior inference with billions of latent variables. How can we cope with such data? In this talk, I will describe stochastic variational inference, an algorithm for computing with topic models that can handle very large document collections and even endless streams of documents. I will demonstrate the algorithm with models fitted to millions of articles. I will show how stochastic variational inference can be generalized to many kinds of hierarchical models. I will highlight several open questions and outstanding issues. (This is joint work with Francis Bach, Matt Hoffman, John Paisley, and Chong Wang.)

**Bio:** David Blei is an associate professor of Computer Science at Princeton University. His research interests include probabilistic topic models, graphical models, approximate posterior inference, and Bayesian nonparametrics.

(an ML talk in the CLSP Speaker Series)

Tue 01/31/12, 10:45am, Hackerman B17**Algorithms for Learning Latent Variable Models***Daniel Hsu, Microsoft Research New England*

**Abstract:** Latent variable models are widely used in applications to automatically recover simple underlying signals from noisy high-dimensional data. The challenge in estimating such models stems from the presence of hidden (unobserved) variables, and typical local search methods used for this task (e.g., E-M) generally lack basic performance guarantees such as statistical consistency and computational efficiency. In this talk, I will discuss recent developments in linear algebraic methods for learning certain classes of latent variable models, including parameter estimation for hidden Markov models, and structure learning of latent variable tree models. Unlike the local search heuristics, the proposed linear algebraic methods come with statistical and computational efficiency guarantees under mild conditions on the data distribution. Central to the new techniques is the characterization of the models in terms of low-order moments (e.g., averages, correlations) of observable variables, which are readily estimated from data.

**Bio:** Daniel Hsu is a postdoctoral researcher at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta, and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.

(an ML talk in the CS Speaker Series)

Wed 12/07/11, 04:00pm, Room W2030 School of Public Health**Bayesian Models for Mining Public Health Information from Twitter***Mark Dredze, Johns Hopkins University*

**Abstract:** Twitter and other social media sites contain a wealth of information about populations and has been used to track sentiment towards products, measure political attitudes, and study social linguistics. In this talk, we investigate the potential for Twitter to impact public health research. Specifically, we consider population surveillance, a major focus of public health that typically depends on clinical encounters with health professionals to collect patient data. Individual users often broadcast salient health information, such as “sick with this flu fever taking over my body ughhhh time for tylenol”, which indicates that not only does this person have the flu, but also a fever and is self-medicating with tylenol. Aggregating such content across millions of users could provide information about numerous aspects of illnesses in the population. In this work we present the Ailment Topic Aspect Model (ATAM), a new Bayesian graphical model for Twitter that associates symptoms, treatments and general words with diseases (ailments.) When applied to 1.6 million health related tweets, ATAM discovers descriptions of diseases in terms of collections of words (symptoms and treatments) and partitions messages based on the referenced disease. The model discovers diseases corresponding to influenza, infections, obesity, insomnia, and several others. Furthermore, we demonstrate the effectiveness of this model at several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.

**Bio:** Mark Dredze is an Assistant Research Professor in Computer Science at Johns Hopkins University and a research scientist in the Human Language Technology Center of Excellence (HLTCOE). He is also affiliated with the Center for Speech and Language Processing (CLSP) and is part of the Machine Learning Group. His research in natural language processing and machine learning has focused on graphical models, semi-supervised learning, information extraction, large-scale learning, speech processing and health informatics. He obtained his PhD from the University of Pennsylvania in 2009.

**Note:** Refreshments at 3:30 PM

(an ML talk in the Biostatistics Speaker Series)

Thu 12/01/11, 01:30pm, Whitehead 304**A Non-Parametric Bayesian Approach to Inflectional Morphology***Jason Eisner, Johns Hopkins University*

**Abstract:** Have you ever studied a foreign language and had to memorize verb conjugations? How many regular and irregular verbs did you have to study before you could generalize the patterns to new verbs? Were you able to acquire more new verbs and new patterns by reading text in the foreign language? These are problems of statistical inference. You are inferring a distribution over vectors of related strings:

**Note:** Refreshments to follow at 2:30 PM

(an ML talk in the AMS Speaker Series)

Thu 11/17/11, 03:45pm, Krieger Hall, Room #111**A Unifying Account of Inductive Reasoning***Dr. Charles Kemp, CMU*

**Abstract:** People learn and reason about animals, spatial relations, kinsfolk, and many other domains, and solve a broad range of inductive problems within each of these domains. The full set of domains and the full set of inductive problems within these domains can be collectively described as the conceptual universe. I will present a systematic characterization of the conceptual universe that helps to clarify the relationships between familiar inductive problems such as property induction, categorization, and identification, and that introduces new inductive problems for psychological investigation. I will illustrate the framework using case studies that include behavioral and computational studies of inductive reasoning, and a computational analysis of kinship classification across cultures.

(an ML talk in the CogSci Speaker Series)

Thu 11/17/11, 01:30pm, Gilman Hall, Room 50**A Computationally Tractable Theory of Performance Analysis in Stochastic Systems***Dimitris Bertsimas, MIT*

**Abstract:** Modern probability theory, whose foundation is based on the axioms set forth by Kolmogorov, is currently the major tool for performance analysis in stochastic systems. While it offers insights in understanding such systems, probability theory is really not a computationally tractable theory. Correspondingly, some of its major areas of application remain unsolved when the underlying systems become multidimensional: Queuing networks, network information theory, pricing multi-dimensional financial contracts, auction design in multi-item, multi-bidder auctions among others. We propose a new approach to analyze stochastic systems based on robust optimization. The key idea is to replace the Kolmogorov axioms as primitives of probability theory, with some of the asymptotic implications of probability theory: the central limit theorem and law of large numbers and to define appropriate robust optimization problems to perform performance analysis. In this way, the performance analysis questions become highly structured optimization problems (linear, conic, mixed integer) for which there exist efficient, practical algorithms that are capable of solving truly large scale systems. We demonstrate that the proposed approach achieves computationally tractable methods for (a) analyzing multiclass queuing networks, (b) characterizing the capacity region of network information theory and associated coding and decoding methods generalizing the work of Shannon, (c) pricing multi-dimensional financial contracts generalizing the work of Black, Scholes and Merton, (d) designing multi-item, multi-bidder auctions generalizing the work of Myerson. This is joint work with my doctoral student at MIT Chaithanya Bandi.

**Bio:** Dimitris Bertsimas is currently the Boeing Professor of Operations Research and the Co-Director of the Operations Research Center at the Massachusetts Institute of Technology. He has received a BS in Electrical Engineering and Computer Science at the National Technical University of Athens, Greece in 1985, a MS in Operations Research at MIT in 1987, and a Ph.D in Applied Mathematics and Operations Research at MIT in 1988. Since 1988, he has been with the MIT faculty. His research interests include optimization, stochastic systems, data mining, and their application. In recent years he has worked in robust optimization, health care and finance. He has co-authored more than 150 scientific papers and he has co-authored the following books: Introduction to Linear Optimization (with J. Tsitsiklis, Athena Scientific and Dynamic Ideas, 2008), Data, Models and Decisions (with R. Freund, Dynamic Ideas, 2004), and Optimization over Integers (with R. Weismantel, Dynamic Ideas, 2005). He is currently department editor in Optimization for Management Science and former area editor in Operations Research in Financial Engineering. He has supervised 46 doctoral students and he is currently supervising 12 others. He is a member of the National Academy of Engineering, and he has received numerous research awards including the Farkas prize (2008), the Erlang prize (1996), the SIAM prize in optimization (1996), the Bodossaki prize (1998) and the Presidential Young Investigator award (1991-1996). He was co-founder of Dynamic Ideas, which developed portfolio management tools for asset management. In 2002, the assets of Dynamic Ideas were sold to American Express. He is founder of Dynamic Ideas Press, a publisher of scientific books and co-founder of Alpha Dynamics, an asset management company.

(an ML talk in the AMS Speaker Series)

Wed 11/16/11, 04:00pm, Room W2030 School of Public Health**Learning Discrete Graphical Model Structure***Pardeep Ravikumar, University of Texas, Austin*

**Abstract:** Undirected graphical models, also known as Markov random fields, are widely used in a variety of domains, including biostatistics, natural language processing and image analysis among others. They compactly represent distributions over a large number of variables using undirected graphs which encodes conditional independence assumptions among the variables. Recovering this underlying graph structure is thus important for many of these applications of MRFs, especially under constrained settings where the number of variables is large, and the samples are limited. In this talk, we will cover three recent approaches for recovering discrete graphical model structure. In the first, we investigate the use of sparse and group-sparse regularization, when combined with pseudo-likelihood-like approximations to the graphical model log-likelihood. In the second, we investigate the use of state-of-the-art variational approximations to the graphical model log- likelihood instead. In the third, we study the use of a simple greedy procedure that iteratively adds and deletes edges. We discuss conditions under which each of these methods can be guaranteed to succeed, with high probability, in recovering the underlying graph structure even under high- dimensional settings. Joint work with Ali Jalali, Christopher Johnson and Eunho Yang.

(an ML talk in the Biostatistics Speaker Series)

Tue 11/15/11, 04:30pm, Hackerman B17**Object Detection Grammars***David McAllester, Toyota Technological Institute at Chicago*

**Abstract:** As statistical methods came to dominate computer vision, speech recognition and machine translation there was a tendency toward shallow models. The late Fred Jelinek is famously quoted as saying that every time he fired a linguist the performance of his speech recognition system improved. A major challenge of modern statistical methods is to demonstrate that deep models can be made to perform better than shallow models. This talk will describe an object detection system which tied for first place in the 2008 and 2009 PASCAL VOC object detection challenge and won a PASCAL “lifetime achievement” award in 2010. The system exploits a grammar model for representing object appearance. This model seems “deeper” than those used in the previous generation of statistically trained object detectors. This object detection system and the associated grammar formalism will be described in detail and future directions discussed.

**Bio:** Professor McAllester received his B.S., M.S., and Ph.D. degrees from the Massachusetts Institute of Technology in 1978, 1979, and 1987 respectively. He served on the faculty of Cornell University for the academic year of 1987-1988 and served on the faculty of MIT from 1988 to 1995. He was a member of technical staff at AT&T Labs-Research from 1995 to 2002. Since 2002 he has been Chief Academic Officer at the Toyota Technological Institute at Chicago. He has been a fellow of the American Association of Artificial Intelligence (AAAI) since 1997. A 1988 paper on computer game algorithms influenced the design of the algorithms used in the Deep Blue system that defeated Gary Kasparov. A 1991 paper on AI planning proved to be one of the most influential papers of the decade in that area. A 1998 paper on machine learning theory introduced PAC-Bayesian theorems which combine Bayesian and nonBayesian methods. A 2001 paper with Andrew Appel introduced the influential step-index model of recursive types. He is currently part of a team that scored in the top two places in the PASCAL object detection challenge (computer vision) in 2007, 2008 and 2009.

(an ML talk in the CLSP Speaker Series)

Tue 11/15/11, 01:00pm, Bloomberg 475**Automated Source Classification for the Synoptic Survey Era***Joey Richards, Berkeley*

**Abstract:** With the fast-approaching deluge of photometric data from synoptic surveys such as Gaia and LSST, there is an urgent need for methods that quickly and automatically classify newly-observed sources from a small number of light-curve measurements. Scientific discovery on such massive data streams is in no way guaranteed: these projects require sophisticated statistical tools for source detection, object classification, outlier detection, and optimal allocation of follow-up resources. In this talk, I will detail our current use of state-of-the-art machine learning methods to perform real-time discovery and event classification for the Palomar Transient Factory (PTF), where the challenge is to find the handful of real astrophysical transients out of 1.5 million nightly candidates. I will also describe how we construct catalogs of probabilistic source classifications using long-baseline retrospective light curves from surveys such as the All Sky Automated Survey (ASAS) and PTF. Finally, I will describe our efforts to overcome sample selection bias, where the distribution of labeled, well-understood objects is an inherently biased sample from the population of interest. In particular, I will show that active learning is a powerful tool to overcome sample selection bias, and will detail its use on a variety of astronomical data sets.

Fri 11/11/11, 12:00pm, Hackerman B17**Learning Semantic Parsers for More Languages and with Less Supervision***Luke Zettlemoyer, University of Washington*

**Abstract:** Recent work has demonstrated effective learning algorithms for a variety of semantic parsing problems, where the goal is to automatically recover the underlying meaning of input sentences. Although these algorithms can work well, there is still a large cost in annotating data and gathering other language-specific resources for each new application. This talk focuses on efforts to address these challenges by developing scalable, probabilistic CCG grammar induction algorithms. I will present recent work on methods that incorporate new notions of lexical generalization, thereby enabling effective learning for a variety of different natural languages and formal meaning representations. I will also describe a new approach for learning semantic parsers from conversational data, which does not require any manual annotation of sentence meaning. Finally, I will sketch future directions, including our recurring focus on building scalable learning techniques while attempting to minimize the application-specific engineering effort.

**Bio:** Luke Zettlemoyer is an Assistant Professor at the University of Washington. His research interests are in the intersections of natural language processing, machine learning and decision making under uncertainty. He spends much of his time developing learning algorithms that attempt to recover and make use of detailed representations of the meaning of natural language text. He was a postdoctoral research fellow at the University of Edinburgh and received his Ph.D. from MIT.

(an ML talk in the CLSP Speaker Series)

Thu 11/10/11, 01:30pm, Whitehead 304**Detecting Change in Multivariate Data Streams Using Minimum Subgraphs***Robert Koyak, Naval Postgraduate School*

**Abstract:** Consider a sequence of independent, multivariate observations on which we test whether their sampling distributions are the same (homogeneity) or change in a systematic manner (heterogeneity). Such change may consist of single “jump” to a new distribution at an unspecified point (“time”) in the observation sequence or a gradual drift such that the metricized distributional change increases with time separation. A matrix of inter-point distances provides a rich store of information about the distributional structure of the observations. When used as edge weights in a complete, undirected graph with vertices consisting of the sequence labels, interesting possibilities arise for tapping into this information with minimum subgraphs of various kinds, including spanning trees and nonbipartite matchings. These subgraphs can yield nonparametric tests for multivariate homogeneity that are quite powerful. With the exception of minimum spanning trees, however, the theory behind these procedures is largely undeveloped. We begin by reviewing these procedures, and then discuss both computational and theoretical challenges to broadening their applicability.

(an ML talk in the AMS Speaker Series)

Wed 11/02/11, 12:00pm, Hackerman B17**Robots with Language: Solving Visual Scene Understanding Tasks***Cornelia Fermüller, University of Maryland*

**Abstract:** Robots with cognition interacting with humans need to create semantic descriptions of the environment they perceive. They need to recognize objects, actions, and events to take appropriate actions. Solving these complex tasks using perception alone is not feasible today, but will require the interaction of different cognitive processes. High level knowledge about the scene should be combined with perceptual recognition processes at different levels in the computation. I suggest that this can be achieved by using natural language to organize the semantic information. I will present a framework for organizing cognitive tasks of robots that implements the interaction of action, vision and language, and show implementations for the task of human activity recognition. In this approach language interacts with vision at different levels: to guide attention, to make predictions, and to reason over what is being perceived.

**Bio:** Cornelia Fermüller is Associate Research Scientist at the Computer Vision Laboratory of the Institute for Advanced Computer Studies, University of Maryland at College Park, where she leads the Cognitive Robotics group. She holds a Ph.D. from the Technical University of Vienna, Austria ( 1993) and an M.S. from the University of Technology, Graz, Austria (1989), both in Applied Mathematics and Computer Science. Prior to joining the University of Maryland in 1994, she held visiting research positions at the Computational Vision and Active Perception Laboratory, Royal Institute of Technology, Stockholm, Sweden (1994) and the Institute of Computer Science, FORTH, Heraklio, Greece (1993) and was Research Associate at the Institute for Image Processing and Computer Graphics, Joanneum Research, Graz, Austria (1989-90). Her research has been in the area of Computer and Human Vision. She is a member of the Editorial Board of the Image and Vision Computing Journal.

(an ML talk in the LCSR Speaker Series)

Tue 11/01/11, 10:45am, Hackerman B17**Perception, Action and the Information Knot that Ties Them***Stefano Soatto, UCLA*

**Abstract:** I will describe a notion of Information for the purpose of decision and control tasks, as opposed to data transmission and storage tasks implicit in Communication Theory. It is rooted in ideas of J. J. Gibson, and is specific to classes of tasks and nuisance factors affecting the data formation process. When such nuisances involve scaling and occlusion phenomena, as in most imaging modalities, the “Information Gap” between the maximal invariants and the minimal sufficient statistics can only be closed by exercising control on the sensing process. Thus, sensing, control and information are inextricably tied. This has consequences in the analysis and design of active sensing systems. I will show applications in vision-based control, navigation, 3-D reconstruction and rendering, as well as detection, localization, recognition and categorization of objects and scenes in live video.

**Bio:** Stefano Soatto is the founder and director of the UCLA Vision Lab (vision.ucla.edu). He received his Ph.D. in Control and Dynamical Systems from the California Institute of Technology in 1996; he joined UCLA in 2000 after being Assistant and then Associate Professor of Electrical and Biomedical Engineering at Washington University, Research Associate in Applied Sciences at Harvard University, and Assistant Professor in Mathematics and Computer Science at the University of Udine, Italy. He received his D.Ing. degree (highest honors) from the University of Padova- Italy in 1992. Dr. Soatto is the recipient of the David Marr Prize (with Y. Ma, J. Kosecka and S. Sastry) for work on Euclidean reconstruction and reprojection up to subgroups. He also received the Siemens Prize with the Outstanding Paper Award from the IEEE Computer Society for his work on optimal structure from motion (with R. Brockett). He received the National Science Foundation Career Award and the Okawa Foundation Grant. He is a Member of the Editorial Board of the International Journal of Computer Vision (IJCV), the International Journal of Mathematical Imaging and Vision (JMIV) and Foundations and Trends in Computer Graphics and Vision.

Mon 10/31/11, 01:30pm, Clark 110**Optimizing the Quantity/Quality Trade-off in Connectome Inference***Carey Priebe, Johns Hopkins University*

**Abstract:** We demonstrate a meaningful prospective power analysis for an (admittedly idealized) illustrative connectome inference task. Modeling neurons as vertices and synapses as edges in a simple random graph model, we optimize the trade-off between the number of (putative) edges identified and the accuracy of the edge identification procedure. We conclude that explicit analysis of the quantity/quality trade-off is imperative for optimal neuroscientific experimental design. In particular, identifying edges faster/more cheaply, but with more error, can yield superior inferential performance. This is joint work with Joshua Vogelstein (JHU AMS) and Davi Bock (HHMI Janelia).

Tue 10/25/11, 04:30pm, Hackerman B17**Sparse Models of Lexical Variation***Jacob Eisenstein, Carnegie Mellon University*

**Abstract:** Text analysis involves building predictive models and discovering latent structures in noisy and high-dimensional data. Document classes, latent topics, and author communities are often distinguished by a small number of trigger words or phrases — needles in a haystack of irrelevant features. In this talk, I describe generative and discriminative techniques for learning sparse models of lexical differences. First, I show how multi-task regression with structured sparsity can identify a small subset of words associated with a range of demographic attributes in social media, yielding new insights about the complex multivariate relationship between demographics and lexical choice. Second, I present SAGE, a novel approach to sparsity in generative models of text, in which we induce sparse deviations from background log probabilities. As a generative model, SAGE can be applied across a range of supervised and unsupervised applications, including classification, topic modeling, and latent variable models.

**Bio:** Jacob Eisenstein is a postdoctoral fellow in the Machine Learning Department at Carnegie Mellon University. His research focuses on machine learning for social media analysis, discourse, and non-verbal communication. Jacob completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. In January 2012, Jacob will join Georgia Tech as an Assistant Professor in the School of Interactive Computing.

(an ML talk in the CLSP Speaker Series)

Tue 10/25/11, 01:00pm, Clark 314**Matrix Splitting Methods for Bound-constrained Quadratic Programming and Linear Complementarity Problems***Daniel Robinson, Johns Hopkins University*

**Abstract:** I present two-phase matrix splitting methods for solving bound-constrained quadratic programs (BQPs) and linear complementarity problems (LCPs).The method for solving BQPs uses matrix splitting iterations to generate descent directions that drive convergence of the iterates and rapidly identify those variables that are active at the solution. The second-phase uses this prediction to further refine the active set and to accelerate convergence. The method for solving LCP combines matrix splitting iterations with a “natural” merit function..This combination allows one to prove convergence of the method and maintain excellent practical performance. Once again, a second subspace phase is used to accelerate convergence. I present numerical results for both algorithms on CUTEr test problems, randomly generated problems, and the pricing of American options.

**Bio:** Daniel Robinson received his Ph.D. from the Department of Mathematics and Statistics at the University of California, San Diego in 2007. In the summer of 2006, he worked for Northrop Grumman as a consultant with his advisor Philip Gill and developed algorithms for trajectory optimization problems. From 2007-2010, Daniel was a research assistant to Nick Gould at the University of Oxford, England, where he developed and implemented optimization algorithms for large-scale nonlinear and convex optimization. In 2010, Daniel was a postdoctoral fellow for Jorge Nocedalin the Industrial Engineering and Management Sciences Department at Northwestern University. Currently, Daniel is an Assistant Professor at Johns Hopkins University in the Department of Applied Mathematics and Statistics. His current research lies at the interface between Applied Linear Algebra, Operations Research, and Applied Mathematics. In particular, he is interested in the formulation and implementation of efficient algorithms for large-scale continuous optimization, machine learning, and linear complementarity problems.

(an ML talk in the CIS Speaker Series)

Thu 10/20/11, 03:00pm, Gilman 132**A Metric between Probability Distributions of Different Sizes***Mathukumalli Vidyasagar, University of Texas-Dallas*

**Abstract:** There are many ways to compare two probability distributions defined on a common set, for instance the total variation metric. However, in problems of reduced‐order modeling, one has to compare probability distributions on sets of different cardinality. In this talk a “Variation of Information” metric is defined for such a purpose, and the problem of optimal order reduction in this metric is also studied. It is shown that the problems of computing the metric as well as order reduction are both closely related to a problem in computer science known as bin‐packing with over‐stuffing.

**Bio:** Mathukumalli Vidyasagar received the B.S., M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin in Madison, in 1965, 1967 and 1969 respectively. Between 1969 and 1989, he was a Professor of Electrical Engineering at various universities in the USA and Canada. In 1989 he returned to India as the Director of the newly created Centre for Artificial Intelligence and Robotics (CAIR), which he built up into a leading research laboratory with about 40 scientists. In 2000 he moved to the Indian private sector as an Executive Vice President of India’s largest software company, Tata Consultancy Services (TCS). He retired from TCS in 2009 at the age of 62, and joined the Erik Jonsson School of Engineering & Computer Science at the University of Texas at Dallas, as a Cecil & Ida Green Chair in Systems Biology Science. In March 2010 he was named the Founding Head of the newly created Bioengineering Department. His current research interests are in the application of stochastic processes and stochastic modeling to problems in computational biology, control systems and quantitative finance.

Tue 10/18/11, 01:00pm, Clark 314**Capturing Human Insight for Large-Scale Visual Learning***Kristen Grauman, University of Texas at Austin*

**Abstract:** How should visual recognition algorithms solicit and exploit human knowledge? Existing approaches often manage human supervision in haphazard ways, and only allow a narrow, one-way channel of input from the annotator to the system. We propose learning algorithms that steer human insight towards where it will have the most impact, and expand the manner in which recognition methods can assimilate that insight. I will present an approach to actively seek annotators’ input when training an object recognition system. Unlike traditional active learning methods, we target not only the example for which a label is most needed, but also the type of label (e.g., an image tag vs. full segmentation). To allow large-scale selection, we introduce novel randomized hashing algorithms that can rapidly identify uncertain points within massive unlabeled pools of data. Using these ideas, we have recently deployed a “live learning” system that autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. It yields state-of-the-art accuracy on some of the most challenging categories in the PASCAL object detection benchmark. Finally, beyond “asking” the right questions, I will briefly describe how we can “listen” more deeply to annotators, learning implied cues about objects’ relative importance in images and videos. This talk describes work with Sudheendra Vijayanarasimhan, Prateek Jain, Sung Ju Hwang, and Yong Jae Lee.

**Bio:** Kristen Grauman is a Clare Boothe Luce Assistant Professor in the Department of Computer Science at the University of Texas at Austin. Her research in computer vision and machine learning focuses on visual search and object recognition. Before joining UT-Austin in 2007, she received her Ph.D. in the MIT EECS department in the Computer Science and Artificial Intelligence Laboratory, and her B.A. in Computer Science from Boston College. She is a Microsoft Research New Faculty Fellow, a recipient of a 2008 NSF CAREER award, and was named one of “AI’s Ten to Watch” by IEEE Intelligent Systems in 2010.

(an ML talk in the CIS Speaker Series)

Fri 10/14/11, 12:00pm, Hackerman B17**Probabilistic Hashing for Similarity Searching and Machine Learning on Large Datasets in High Dimensions***Ping Li, Cornell University*

**Abstract:** Many applications such as information retrieval make use of efficient (approximate) estimates of set similarity. A number of such estimates have been discussed in the literature: minwise hashing, random projections and compressed sensing. This talk presents an improvement: b-bit minwise hashing. An evaluation on large real-life datasets will show large gains in both space and time. In addition, we will characterize the improvement theoretically, and show that the theory matches the practice. More recently, we realized that (b-bit) minwise hashing can not only be used for similarity matching but also for machine learning. Applying logistic regression and SVMs to large datasets faces numerous practical challenges. As datasets become larger and larger, they take too long to load and may not fit in memory. Training and testing time can become an issue. Error analysis and exploratory data analysis are rarely performed on large datasets because it is too painful to run lots of what-if scenarios and explore lots of high-order interactions (pairwise, 3-way, etc.). The proposed method has been applied to two large datasets: a “smaller” dataset (24GB in 16M dimensions) and a “larger” dataset (200GB in 1B dimensions). Using a single desktop computer, the proposed method takes 3 seconds to train an SVM for the smaller dataset and 30 seconds for the larger dataset.

(an ML talk in the CLSP Speaker Series)

Tue 10/11/11, 10:30am, Levering, Great Hall**Open Science: The Promise and the Challenge***Michael Nielsen*

**Abstract:** The net is transforming many aspects of our society, from finance to friendship. And yet scientists, who helped create the net, are extremely conservative in how they use it. Although the net has great potential to transform science, most scientists remain stuck in a centuries-old system for the construction of knowledge. I will describe some leading-edge projects that show how online tools can radically change and improve science (using projects in Mathematics and Citizen Science as examples), and will then go on to discuss why these tools haven’t spread to all corners of science, and how we can change that.

**Bio:** Michael Nielsen is an author and an advocate of open science. His book about open science, Reinventing Discovery, will be published by Princeton University Press in October, 2011. Prior to his book, Michael was an internationally known scientist who helped pioneer the field of quantum computation. He co-authored the standard text in the field, and wrote more than 50 scientific papers, including invited contributions to Nature and Scientific American. His work on quantum teleportation was recognized in Science Magazine’s list of the Top Ten Breakthroughs of 1998. Michael was educated at the University of Queensland, and as a Fulbright Scholar at the University of New Mexico. He worked at Los Alamos National Laboratory, as the Richard Chace Tolman Prize Fellow at Caltech, was Foundation Professor of Quantum Information Science and a Federation Fellow at the University of Queensland, and a Senior Faculty Member at the Perimeter Institute for Theoretical Physics. In 2008, he gave up his tenured position to work fulltime on open science.

Wed 09/28/11, 04:00pm, Room W2030 School of Public Health**Personalized Medicine and Statistical Learning***Michael Kosorok, UNC Chapel Hill*

**Abstract:** Personalized medicine is an important and active area of clinical research. In this talk, we will systematically review recent publications in the area and outline the main scientific approaches and statistical issues at play. We will also describe some recent design and methodological developments in clinical trials for discovery and evaluation of personalized medicine. Statistical learning tools from artificial intelligence, including machine learning and reinforcement learning, are beginning to play increasingly important roles in these areas. We present several illustrative examples in treatment of depression, cancer and cystic fibrosis.

(an ML talk in the Biostatistics Speaker Series)

Tue 09/20/11, 04:30pm, Hackerman B17**When Topic Models Go Bad: Diagnosing and Improving Models for Exploring Large Corpora***Jordan Boyd-Graber, University of Maryland*

**Abstract:** Imagine you need to get the gist of what’s going on in a large text dataset such as all tweets that mention Obama, all e-mails sent within a company, or all newspaper articles published by the New York Times in the 1990s. Topic models, which automatically discover the themes which permeate a corpus, are a popular tool for discovering what’s being discussed. However, topic models aren’t perfect; errors hamper adoption of the model, performance in downstream computational tasks, and human understanding of the data. However, humans can easily diagnose and fix these errors. We describe crowdsourcing experiments to detect problematic topics and to determine which models produce comprehensible topics. Next, we present a statistically sound model to incorporate hints and suggestions from humans to iteratively refine topic models to better model large datasets. If time permits, we will also examine how topic models can be used to understand topic control in debates and discussions.

**Bio:** Jordan Boyd-Graber in an assistant professor in the College of Information Studies and the Institute for Advanced Computer Studies at the University of Maryland, focusing on the interaction of users and machine learning: how algorithms can better learn from human behaviors and how users can better communicate their needs to machine learning algorithms. Previously, he worked as a postdoc with Philip Resnik at the University of Maryland. Until 2009, he was a graduate student at Princeton University working with David Blei on linguistic extensions of topic models. His current work is supported by NSF, IARPA, and ARL.

(an ML talk in the CLSP Speaker Series)

Fri 09/16/11, 12:00pm, Hackerman B17**Short URLs, Big Data: Machine Learning at Bitly***Hilary Mason, Bit.ly*

**Abstract:** Bitly is a URL shortening service, gathering hundreds of millions of data points about the links people share every day. I’ll discuss the data analysis techniques that we use, giving examples of machine learning problems that we are solving at scale, and talk about the differences between industry, startup, and academic research.

**Bio:** Hilary Mason is the Chief Scientist at bit.ly, where she finds sense in vast data sets. Her work involves both pure research and development of product-focused features. She’s also a co-founder of HackNY (hackny.org), a non-profit organization that connects talented student hackers from around the world with startups in NYC. Hilary recently started the data science blog Dataists (dataists.com) and is a member of hacker collective NYC Resistor. She has discovered two new species, loves to bake cookies, and asks way too many questions.

(an ML talk in the CLSP Speaker Series)

Wed 09/07/11, 12:00pm, Hackerman B17**Micro and Nano Robotics and Applications in Health Care***Brad Nelson, ETH Zurich*

**Abstract:** Microrobotics has entered the phase in which sub-mm autonomous robots are being realized. While the potential impact of these devices on society is high, particularly for biomedical applications, many challenges remain in developing microrobots that will be useful to society. This talk will discuss possible applications of future microrobotic technologies in health care as well as approaches to the locomotion of microrobots in liquid and on solid surfaces. Issues in the design of external systems for providing energy and control of microrobots must be considered, and the use of externally generated magnetic fields in particular appears to be a promising strategy. Theoretical and experimental issues will be discussed, functionalization of the devices, and efforts to scale microrobots to the nanodomain will be presented.

**Bio:** Brad Nelson is the Professor of Robotics and Intelligent Systems at ETH Zürich. His primary research focus is on microrobotics and nanorobotics with an emphasis on applications in biology and medicine. He received a B.S.M.E. from the University of Illinois at Urbana-Champaign and an M.S.M.E. from the University of Minnesota. He has worked as an engineer at Honeywell and Motorola and served as a United States Peace Corps Volunteer in Botswana, Africa, before obtaining a Ph.D. in Robotics from Carnegie Mellon University in 1995. He was an Assistant Professor at the University of Illinois at Chicago (1995-1998) and an Associate Professor at the University of Minnesota (1998-2002). He became a Full Professor at ETH Zürich in 2002.

(an ML talk in the LCSR Speaker Series)

Tue 09/06/11, 04:30pm, Hackerman B17**Learning to Describe Images***Julia Hockenmaier, University of Illinois, Urbana-Champaign*

**Abstract:** How can we create an algorithm that learns to associate images with sentences in natural language that describe the situations depicted in them? This talk will describe ongoing research towards this goal, with a focus on the natural language understanding aspects. Although we believe that this task may benefit from improved object recognition and deeper linguistic analysis, we show that models that rely on simple perceptual cues of color, texture and local feature descriptors on the image side, and on sequence-based features on the text side, can do surprisingly well. We also demonstrate how to leverage the availability of multiple captions for the same image.

**Bio:** Julia Hockenmaier is assistant professor of computer science at the University of Illinois at Urbana-Champaign. She came to Illinois after a postdoc at the University of Pennsylvania and a PhD at the University of Edinburgh. She holds an NSF CAREER award.

(an ML talk in the CLSP Speaker Series)

Tue 09/06/11, 10:45am, Hackerman B17**Twenty Questions with Noisy Answers for Object Detection and Tracking***Raphael Sznitman, Johns Hopkins University*

**Abstract:** In the traditional “twenty questions” game, the task at hand is to determine a fact or target location, by sequentially asking a knowledgeable oracle questions. This problem has been extensively studied in the past, and results on optimal questioning strategies are well understood. In this thesis however, we consider the case where the answers from the oracle are corrupted with noise from a known model. With this problem occurring both in nature and in a number of computer vision applications (i.e. object detection and localization, tracking, image registration) the goal then is to determine some policy, or sequence of questions, that reduces the uncertainty on the target location as much as possible. We begin by presenting a Bayesian formulation of a simple and idealized parameter estimation problem. Starting with a prior distribution on the parameter, principles in dynamic programming and information theory can be used to characterize an optimal policy when minimizing the expected entropy of the distribution of the parameter. We then show the existence of a simple greedy policy that is globally optimal. Given these results, we describe a series of stochastic optimization algorithms that embody the noisy twenty questions game paradigm in the context of computer vision. These algorithms are referred to as: Active Testing. We describe the benefit of using this technique in two real-world applications: (i) face detection and localization, and (ii) tool tracking during retinal microsurgery. In the first application, we show that substantial computational gains over existing approaches are achieved when localizing faces in images. In the second, we tackle a much more challenging real-world application where one must find the position and orientation of a surgical tool during surgery. Our approach provides a new and innovative way to perform fast and reliable tool tracking in cases when the tool moves in and out of the field of view often.

**Bio:** Raphael Sznitman is a Ph.D. candidate in the Department of Computer Science at Johns Hopkins University, where he has worked under the supervision of Dr. Gregory Hager and Dr. Bruno Jedynak. He is a member of the Computational Interactions and Robotics Laboratory (CIRL) and the Laboratory for Computational Sensing and Robotics (LCSR). In 2007, he received his Bachelors of Science from the University of British Columbia (Vancouver, Canada), where he studied Cognitive Systems. He then joined the Computer Science department at JHU and received his Masters of Science in Engineering from Johns Hopkins University in 2008. His research interests lie in the topics of computer vision, object localization and detection, object tracking, stochastic optimization, and machine learning.

(an ML talk in the CS Speaker Series)

Fri 09/02/11, 12:00pm, Hackerman B17**Applications of Weighted Finite State Transducers in a Speech Recognition Toolkit***Daniel Povey, Microsoft*

**Abstract:** The open-source speech recognition toolkit “Kaldi” uses weighted finite state transducer (WFSTs) for training and decoding, and uses the OpenFst toolkit as a C++ library. I will give an informal overview of WFSTs and of the standard AT&T recipe for WFST based decoding, and will mention some problems (in my opinion) with the basic recipe and how we addressed them while developing Kaldi. I will also describe how to use WFSTs to acheive “exact” lattice generation, in a sense will be explained. This is an interesting application of WFSTs because, unlike most WFST mechanisms, it does not have any obvious non-WFST analog.

**Bio:** Daniel Povey received his Bachelor’s (Natural Sciences, 1997), Master’s (Computer Speech and Language Processing, 1998) and PhD (Engineering, 2003) from Cambridge University. He is currently a researcher at Microsoft Research, Redmond, Washington, USA. From 2003 to 2008 he worked as a researcher in IBM Research in Yorktown Heights, NY. He is best known for his work on discriminative training for HMM-GMM based speech recognition (i.e. MMI, MPE, and their feature-space variants).

(an ML talk in the CLSP Speaker Series)

Wed 08/10/11, 10:30am, Hackerman B17**Hierarchical Modeling and Prior Information: An Example From Toxicology***Andrew Gelman, Columbia University*

**Abstract:** We describe a general approach using Bayesian analysis for the estimation of parameters in physiological pharmacokinetic models. The chief statistical difficulty in estimation with these models is that any physiological model that is even approximately realistic will have a large number of parameters, often comparable to the number of observations in a typical pharmacokinetic experiment (e.g., 28 measurements and 15 parameters for each subject). In addition, the parameters are generally poorly identified, as in the well-known ill-conditioned problem of estimating a mixture of declining exponentials Our modeling includes (a)hierarchical population modeling, which allows partial pooling of information among different experimental subjects; (b) a pharmacokinetic model including compartments for well-perfused tissues, poorly perfused tissues, fat, and the liver; and (c) informative prior distributions for population parameters, which is possible because the parameters represent real physiological variables. We discuss how to estimate the models using Bayesian posterior simulation, a method that automatically includes the uncertainty inherent in estimating such a large number of parameters. We also discuss how to check model fit and sensitivity to the prior distribution using posterior predictive simulation.

**Bio:** Andrew Gelman is a professor of statistics and political science and director of the Applied Statistics Center at Columbia University. He has received the Outstanding Statistical Application award from the American Statistical Association, the award for best article published in the American Political Science Review, and the Council of Presidents of Statistical Societies award for outstanding contributions by a person under the age of 40. His books include Bayesian Data Analysis (with John Carlin, Hal Stern, and Don Rubin), Teaching Statistics: A Bag of Tricks (with Deb Nolan), Data Analysis Using Regression and Multilevel/Hierarchical Models (with Jennifer Hill), and, most recently, Red State, Blue State, Rich State, Poor State: Why Americans Vote the Way They Do (with David Park, Boris Shor, Joe Bafumi, and Jeronimo Cortina).

(an ML talk in the CLSP Speaker Series)

Wed 07/27/11, 10:30am, Hackerman B17**Large Scale Supervised Embedding for Text and Images***Jason Weston, Google*

**Abstract:** In this talk I will present two related pieces of research for text retrieval and image annotation that both use supervised embedding algorithms over large datasets. Part 1: The first part of the talk presents a class of models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like latent semantic indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However unlike LSI, our models are trained with a supervised signal directly on the task of interest, which we argue is the reason for our superior results. We provide an empirical study on Wikipedia documents, using the links to define document-document or query-document pairs, where we beat several baselines. We also describe extensions to the nonlinear case and for dealing with huge dictionary sizes. (Joint work with Bing Bai, David Grangier and Ronan Collobert.) Part 2: Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a well performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the “sibling” precision metric, where our method also obtains good results. (Joint work with Samy Bengio and Nicolas Usunier.)

**Bio:** Jason Weston is a Research Scientist at Google NY since July 2009. He earned his PhD in machine learning at Royal Holloway, University of London and at AT&T Research in Red Bank, NJ (advisor: Vladimir Vapnik) in 2000. From 2000 to 2002, he was a Researcher at Biowulf technologies, New York. From 2002 to 2003 he was a Research Scientist at the Max Planck Institute for Biological Cybernetics, Tuebingen, Germany. From 2003 to June 2009 he was a Research Staff Member at NEC Labs America, Princeton. His interests lie in statistical machine learning and its application to text, audio and images. Jason has published over 80 papers, including best paper awards at ICML and ECML.

(an ML talk in the CLSP Speaker Series)

Wed 07/20/11, 10:30am, Hackerman B17**Distribution Fields for Low Level Vision***Erik Learned-Miller, University of Massachusetts, Amherst*

**Abstract:** Consider the following fundamental problem of low level vision: given a large image I an a patch J from another image, find the “best matching” location of the patch J to image I. We believe the solution to this problem can be significantly improved. A significantly better solution to this problem has the potential to improve a wide variety of low-level vision problems, such as backgrounding, tracking, medical image registration, optical flow, image stitching, and invariant feature definition. We introduce a set of techniques for solving this problem based upon a representation called distribution fields. Distribution fields are an attempt to take the best from a wide variety of low-level vision techniques including geometric blur (Berg), mixture of Gaussians backgrounding (Stauffer), SIFT (Lowe) and HoG (Dalal and Triggs), local color histograms, bilateral filtering, congealing (Learned-Miller) and many other techniques. We show how distribution fields solve this “patch” matching problem, and, in addition to finding the optimum match of patch J to image I with a high success rate, the algorithm produces, as a by-product, a very natural assessment of the quality of that match. We call this algorithm the “sharpening match”. Using the sharpening match for tracking yields an extremely simple but state-of-the-art tracker. We also discuss application of these techniques to background subtraction and other low level vision problems.

**Bio:** Erik G. Learned-Miller (previously Erik G. Miller) is an Associate Professor of Computer Science at the University of Massachusetts, Amherst, where he joined the faculty in 2004. He spent two years as a post-doctoral researcher at the University of California, Berkeley, in the Computer Science Division. Learned-Miller received a B.A. in Psychology from Yale University in 1988. In 1989, he co-founded CORITechs, Inc., where he and co-founder Rob Riker developed the second FDA cleared system for image-guided neurosurgery. He worked for Nomos Corporation, Pittsburgh, PA, for two years as the manager of neurosurgical product engineering. He obtained Master of Science (1997) and Ph. D. (2002) degrees from the Massachusetts Institute of Technology, both in Electrical Engineering and Computer Science. In 2006, he received an NSF CAREER award for his work in computer vision and machine learning.

(an ML talk in the CLSP Speaker Series)

Tue 04/26/11, 04:30pm, Hackerman B17**Building Watson: An Overview of DeepQA for the Jeopardy! Challenge***David Ferrucci, IBM*

**Abstract:** Computer systems that can directly and accurately answer peoples’ questions over a broad domain of human knowledge have been envisioned by scientists and writers since the advent of computers themselves. Open domain question answering holds tremendous promise for facilitating informed decision making over vast volumes of natural language content. Applications in business intelligence, healthcare, customer support, enterprise knowledge management, social computing, science and government would all benefit from deep language processing. The DeepQA project is aimed at exploring how advancing and integrating Natural Language Processing (NLP), Information Retrieval (IR), Machine Learning (ML), massively parallel computation and Knowledge Representation and Reasoning (KR&R) can greatly advance open-domain automatic Question Answering. An exciting proof-point in this challenge is to develop a computer system that can successfully compete against top human players at the Jeopardy! quiz show (www.jeopardy.com). Attaining champion-level performance Jeopardy! requires a computer system to rapidly and accurately answer rich open-domain questions, and to predict its own performance on any given category/question. The system must deliver high degrees of precision and confidence over a very broad range of knowledge and natural language content with a 3-second response time. To do this DeepQA evidences and evaluates many competing hypotheses. A key to success is automatically learning and combining accurate confidences across an array of complex algorithms and over different dimensions of evidence. Accurate confidences are needed to know when to “buzz in” against your competitors and how much to bet. High precision and accurate confidence computations are just as critical for providing real value in business settings where helping users focus on the right content sooner and with greater confidence can make all the difference. The need for speed and high precision demands a massively parallel computing platform capable of generating, evaluating and combing 1000’s of hypotheses and their associated evidence. In this talk I will introduce the audience to the Jeopardy! Challenge and how we tackled it using DeepQA.

**Bio:** Dr. David Ferrucci is the lead researcher and Principal Investigator (PI) for the Watson/Jeopardy! project. He has been a Research Staff Member at IBM’s T.J. Watson’s Research Center since 1995 where he heads up the Semantic Analysis and Integration department. Dr. Ferrucci focuses on technologies for automatically discovering valuable knowledge in natural language content and using it to enable better decision making. As part of his research he led the team that developed UIMA. UIMA is a software framework and open standard widely used by industry and academia for collaboratively integrating, deploying and scaling advanced text and multi-modal (e.g., speech, video) analytics. As chief software architect for UIMA, Dr. Ferrucci led its design and chaired the UIMA standards committee at OASIS. The UIMA software framework is deployed in IBM products and has been contributed to Apache open-source to facilitate broader adoption and development. In 2007, Dr. Ferrucci took on the Jeopardy! Challenge – tasked to create a computer system that can rival human champions at the game of Jeopardy!. As the PI for the exploratory research project dubbed DeepQA, he focused on advancing automatic, open-domain question answering using massively parallel evidence based hypothesis generation and evaluation. By building on UIMA, on key university collaborations and by taking bold research, engineering and management steps, he led his team to integrate and advance many search, NLP and semantic technologies to deliver results that have out-performed all expectations and have demonstrated world-class performance at a task previously thought insurmountable with the current state-of-the-art. Watson, the computer system built by Ferrucci’s team is now competing with top Jeopardy! champions. Under his leadership they have already begun to demonstrate how DeepQA can make dramatic advances for intelligent decision support in areas including medicine, finance, publishing, government and law. Dr. Ferrucci has been the Principal Investigator (PI) on several government-funded research programs on automatic question answering, intelligent systems and saleable text analytics. His team at IBM consists of 28 researchers and software engineers specializing in the areas of Natural Language Processing (NLP), Software Architecture, Information Retrieval, Machine Learning and Knowledge Representation and Reasoning (KR&R). Dr. Ferrucci graduated from Manhattan College with a BS in Biology and from Rensselaer Polytechnic Institute in 1994 with a PhD in Computer Science specializing in knowledge representation and reasoning. He is published in the areas of AI, KR&R, NLP and automatic question-answering.

(an ML talk in the CLSP Speaker Series)

Tue 04/26/11, 01:00pm, Clark 314**Compressive Sensing for Computer Vision***Rama Chellappa, University of Maryland*

**Abstract:** The emerging theory of compressive sensing has immense implications for designing novel computer Vision algorithms and systems. In this talk, I will discuss the basic theory behind compressive sensing and present several examples from computer vision. Specifically, I will discuss two algorithms and systems for generating fast video sequences from standard video sequences for periodic and general motion, 2D/3D reconstructions from sparse gradients, background subtraction, video synthesis and classification using compressive measurements and compressive SAR.

**Bio:** Prof. Rama Chellappa received the B.E. (Hons.) degree from the University of Madras, India, in 1975 and the M.E. (Distinction) degree from Indian Institute of Science, Bangalore, in 1977. He received M.S.E.E. and Ph.D. Degrees in Electrical Engineering from Purdue University, West Lafayette, IN, in 1978 and 1981 respectively. Since 1991, he has been a Professor of Electrical Engineering and an affiliate Professor of Computer Science at University of Maryland, College Park. He is also affiliated with the Center for Automation Research (Director) and the Institute for Advanced Computer Studies (Permanent Member). In 2005, he was named a Minta Martin Professor of Engineering.

(an ML talk in the CIS Speaker Series)

Tue 04/19/11, 01:00pm, Clark 314**New Methods for Fast, Large, & Accurate Sequence Analytics***Vladimir Pavlovic, Rutgers University*

**Abstract:** Analysis of sequences, such as sequence classification or clustering, is a challenging problem that spans a number of fields in science and technology. For example, elucidation of protein functions from primary DNA sequence, “barcoding” of living species, understanding of text or music corpora are among the tasks that, in their core, rely on one’s ability to accomplish classification of strings of varying lengths and content. In this talk I will present my group’s recent work that aims to address the string classification problem in a simple, robust yet computationally highly efficient manner. I will discuss several new methods for sequence embedding into “Euclidean” spaces, including the Sparse Spatial Sample kernel (SSS). The SSS is astoundingly simple in its representation, yet the subspace it determines appears to very closely match the sequence manifolds for many practical problems. The SSS exemplifies a new family of spectral algorithms for fast string matching, which generalize many known sequence similarity measures. This new family of linear time algorithms improves theoretical complexity bounds of existing approaches while scaling well with respect to the sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets with loose mismatch constraints our algorithms are several orders of magnitude faster than the existing state-of-the-art. I will then present a number of results on music genre classification, text classification, protein remote homology and fold prediction, and DNA barcoding will illustrate some of the approach’s benefits on a rather disparate set of problems. Time-permitting, I will also discuss a new parametric sequence model, the Ordinal Hidden Markov chain, which highlights some new directions in sequence modeling.

**Bio:** Vladimir Pavlovic is an Associate Professor in the Computer Science Department at Rutgers University. He received the PhD in electrical engineering from the University of Illinois in Urbana-Champaign in 1999. Vladimir’s research interests include probabilistic system modeling, time-series analysis, statistical computer vision and bioinformatics. He has published over 80 peer-reviewed papers in major computer vision, machine learning and pattern recognition journals and conferences.

(an ML talk in the CIS Speaker Series)

Fri 04/15/11, 12:00pm, Mergenthaler 111**Training a Computer to See People***Deva Ramanan, University of California, Irvine*

**Abstract:** One of the great, open challenges in machine vision is to train a computer to “see people.” A reliable solution opens up tremendous possibilities, from automated persistent surveillance and next-generation image search, to more intuitive computer interfaces. It is difficult to analyze people, and objects in general, because their appearance can vary due to a variety of “nuisance” factors (including viewpoint, body pose, and clothing) and because real-world images contain clutter. I will describe machine learning algorithms that accomplish such tasks by encoding image statistics of the visual world learned from large-scale training data. I will focus on predictive models that produce rich, structured descriptions of images and videos (How many people are present? What are they doing?) and models that compensate for nuisance factors through the use of latent variables. I will illustrate such approaches for the tasks of object detection, people tracking, and activity recognition, producing state-of-the-art systems as evidenced by recent benchmark competitions.

**Bio:** Deva Ramanan is an assistant professor of Computer Science and the co-director of the Computational Vision Lab at the University of California at Irvine. Prior to joining UCI, he was a Research Assistant Professor at the Toyota Technological Institute at Chicago (2005-2007). He also held visiting researcher positions in the Robotics Institute at Carnegie Mellon University in 2006 and Microsoft Research in 2008. He received his B.S. degree with distinction in computer engineering from the University of Delaware in 2000, graduating summa cum laude. He received his Ph.D. in Electrical Engineering and Computer Science with a Designed Emphasis in Communication, computation, and Statistics from UC Berkeley in 2005. His research interests span computer vision, machine learning, and computer graphics, with a focus on the application of understanding people through images and video. His past work focused on articulated tracking, while recent work has focused on object recognition. His work in this area won or received special recognition at the PASCAL Visual Object Class Challenge, 2007-2010, including a Lifetime Achievement Prize in 2010. His work on contextual object modeling won the 2009 David Marr prize. He was awarded an NSF Career Award in 2010. His work is supported by NSF, ONR, DARPA, as well as industrial collaborations with the Intel Science and Technology Center for Visual Computing, Google Research, and Microsoft Research. He serves on the editorial board of the International Journal of Computer Vision (IJCV), is a senior program committee member for the IEEE Conference of Computer Vision and Pattern Recognition (CVPR), and has served on multiple NSF panels for computer vision and machine learning.

(an ML talk in the LCSR Speaker Series)

Wed 04/13/11, 01:00pm, Clark 314**Extracting Wiring Diagrams From Brain: The First 10 Teravoxels***Davi Bock, Janelia Farm Research Campus*

**Abstract:** The connections made by cortical brain cells are anatomically nanoscopic, yet each cell in the cortex has several centimeters of local anatomical ‘wiring’. This wiring packs the cortical volume essentially completely. We recently characterized the in vivo responses of a group of cells in mouse visual cortex, then imaged a volume of brain containing the cells using a custom-built high throughput electron microscopy (EM) camera array. Each voxel in the resulting data set occupies about 4 x 4 x 45 nanometers of brain; the 10 teravoxel volume spans 450 x 350 x 50 micrometers. The imaged volume is of sufficient size and resolution that we were able to trace the local connectivity of the physiologically characterized cells. One can therefore record what cells in the brain are doing, then trace their connectivity, a combination which could enable a new level of understanding of cortical circuits to be achieved. However, issues of data scale will have to be overcome for this potential to be realized. This talk will introduce the biological domain, the technologies involved, and will lay out some of the hurdles that lay ahead.

(an ML talk in the CIS Speaker Series)

Tue 04/05/11, 10:45am, Hackerman B17**Deep Semantics from Shallow Supervision***Percy Liang, University of California, Berkeley*

**Abstract:** What is the total population of the ten largest capitals in the US? Building a system to answer free-form questions such as this requires modeling the deep semantics of language. But to develop practical, scalable systems, we want to avoid the costly manual annotation of these deep semantic structures and instead learn from just surface-level supervision, e.g., question/answer pairs. To this end, we develop a new tree-based semantic representation which has favorable linguistic and computational properties, along with an algorithm that induces this hidden representation. Using our approach, we obtain significantly higher accuracy on the task of question answering compared to existing state-of-the-art methods, despite using less supervision.

**Bio:** Percy Liang obtained a B.S. (2004) and an M.S. (2005) from MIT and is now completing his Ph.D. at UC Berkeley with Michael Jordan and Dan Klein. The general theme of his research, which spans machine learning and natural language processing, is learning richly-structured statistical models from limited supervision. He has won a best student paper at the International Conference on Machine Learning in 2008, received the NSF, GAANN, and NDSEG fellowships, and is also a 2010 Siebel Scholar.

(an ML talk in the CS Speaker Series)

Thu 03/17/11, 10:45am, Hackerman B17**Discovery and Prediction from Clinical Temporal Data***Suchi Saria, Stanford University*

**Abstract:** Physiological data are routinely recorded in intensive care, but their use for rapid assessment of illness severity has been limited. The data is high-dimensional, noisy, and changes rapidly; moreover, small changes that occur in a patient’s physiology over long periods of time are difficult to detect, yet can lead to catastrophic outcomes. A physician’s ability to recognize complex patterns across these high-dimensional measurements is limited. We propose a nonparametric Bayesian method for discovering informative representations in such continuous time series that aids both exploratory data analysis and feature construction. When applied to data from premature infants in the neonatal ICU (NICU), our model obtains novel clinical insights. Based on these insights, we devised the Physiscore, a novel risk prediction score that combines patterns from continuous physiological signals to predict infants at risk for developing major complications in the NICU. Using only 3 hours of non-invasive data from birth, Physiscore very successfully predicts morbidity in preterm infants. Physiscore performed consistently better than other neonatal scoring systems, including the Apgar, which is the current standard of care, and SNAP, a machine learning based score that requires multiple invasive tests. This work was recently published on the cover of Science Translational Medicine (Science’s new journal aimed at translational medicine work), and was covered by numerous press sources.

**Bio:** Suchi Saria is finishing her PhD in Computer Science at Stanford under Daphne Koller. Her main research interest lies in machine learning and data driven optimizations for health care. Her work has appeared on popular press sources such as CBS Radio, Science NOW, KCBS and The San Francisco chronicle. Her works have also been given a Best Student Paper and a Best Student Paper finalist awards. She is a recipient of the Stanford Graduate Fellowship (SGF) and two Microsoft full-tuition scholarships.

(an ML talk in the CS Speaker Series)

Tue 03/15/11, 01:00pm, Clark 314**Landmark-Dependent Hierarchical Beta Process for Robust Sparse Factor Analysis***Lawrence Carin, Duke University*

**Abstract:** A landmark-dependent hierarchical beta process is developed as a prior for data with associated covariates. The landmarks define local regions in the covariate space where feature usages are likely to be similar. The landmark locations are learned, to which the data are linked through normalized kernels. To demonstrate unique aspects of the proposed model, we consider two applications: (i) denoising of an image contaminated by a superposition of Gaussian and spiky noise, and (ii) topic and spiky-keyword discovery from a document corpora. State-of-the-art performance is demonstrated, with efficient inference using hybrid Gibbs, Metropolis-Hastings and slice sampling.

**Bio:** Lawrence Carin and earned the BS, MS, and PhD degrees in electrical engineering at the University of Maryland, College Park, in 1985, 1986, and 1989, respectively. In 1989 he joined the Electrical Engineering Department at Polytechnic University (Brooklyn) as an Assistant Professor, and became an Associate Professor there in 1994. In September 1995 he joined the Electrical Engineering Department at Duke University, where he is now the William H. Younger Professor of Engineering. He is a co-founder of Signal Innovations Group, Inc. (SIG), a small business, where he serves as the Director of Technology. His current research interests include signal processing, sensing, and machine learning. He has published over 200 peer-reviewed papers, and he is an IEEE Fellow.

(an ML talk in the CIS Speaker Series)

Tue 03/15/11, 10:45am, Hackerman B17**Non-Commutative Harmonic Analysis in Machine Learning***Risi Kondor, Caltech*

**Abstract:** Non-commutative harmonic analysis generalizes the notion of Fourier transformation to rotations, permutations, or indeed, any compact group of transformations acting on some underlying space. We have found that this theory and the corresponding fast Fourier transforms have a whole range of natural applications in machine learning and other areas of computer science. I will give an overview of these developments, touching on invariant features for computer vision, compact representations of uncertainty in multi-object tracking, graph kernels, and strategies for solving hard optimization problems.

**Bio:** Risi Kondor obtained his B.A. in Mathematics from the University of Cambridge. After some further studies in Physics and Computational Fluid Dynamics, he changed direction to Machine Learning, getting his M.Sc. from CALD (the precursor to the Machine Learning Department) at Carnegie Mellon University in 2002, and his Ph.D from Tony Jebara’s group at Columbia University in 2007. His first post-doc took him to London, where he worked at the Gatsby Computational Neuroscience at UCL, and he is now completing his second post-doc at the Center for the Mathematics of Information at Caltech.

(an ML talk in the CS Speaker Series)

Tue 03/08/11, 10:45am, Hackerman B17**Scaling Up Probabilistic Inference***Vibhav Gogate, University of Washington*

**Abstract:** Graphical models and their logic-based extensions such as Markov logic have become the central paradigm for representation and reasoning in machine learning, artificial intelligence and computer science. They have led to many successful applications in domains such as Bio-informatics, data mining, computer vision, social networks, entity resolution, natural language processing, and hardware and software verification. For them to be useful in these and other future applications, we need access to scalable probabilistic inference systems that are able to accurately analyze and model large amount of data and make future predictions. Unfortunately, this is hard because exact inference crosses the #P boundary and is computationally intractable. Therefore, in practice, one has to resort to approximate algorithms and hope that they are accurate and scalable. In this talk, I’ll describe my research on scaling up approximate probabilistic inference algorithms to unprecedented levels. These algorithms are based on exploiting structural features present in the problem. I’ll show that all approximate inference schemes developed to date have completely ignored or under-utilized structural features such as context-specific independence, determinism and logical dependencies that are prevalent in many real-world problems. I’ll show how to change this by developing theoretically well-founded structure-aware algorithms that are simple yet very effective in practice. In particular, I’ll present results from the recently held 2010 UAI approximate inference challenge in which my schemes won several categories, outperforming the competition by an order of magnitude on the hardest problems. I’ll conclude by describing exciting opportunities that lie ahead in the area of graphical models and statistical relational learning.

**Bio:** Vibhav Gogate completed his Ph.D. from University of California, Irvine in 2009 and is now a Post Doctoral Research Associate at University of Washington. His research interests are in machine learning and artificial intelligence with a focus on graphical models and statistical relational learning. He has authored over 15 papers that have appeared in high-profile conferences and journals and is the co-winner of the 2010 Uncertainty in Artificial Intelligence (UAI) approximate inference challenge.

(an ML talk in the CS Speaker Series)

Tue 03/01/11, 01:00pm, Clark 314**Rank/Sparsity Minimization and Latent Variable Graphical Model Selection***Pablo Parrilo, MIT*

**Abstract:** Suppose we have a Gaussian graphical model with sample observations of only a subset of the variables. Can we separate the extra correlations induced due to marginalization over the unobserved, hidden variables from the structure among the observed variables? In other words, is it still possible to consistently perform model selection despite the latent variables? As we shall see, the key issue here is to decompose the concentration matrix of the observed variables into a sparse matrix (representing graphical model structure among the observed variables) and a low-rank matrix (representing the effects of marginalization over the hidden variables). This estimator is given by a tractable convex program, and it consistently estimates model structure in the high-dimensional regime in which the number of observed/hidden variables grow with the number of samples of the observed variables. In our analysis the algebraic varieties of sparse matrices and low-rank matrices play an important role. Joint work with Venkat Chandrasekaran and Alan Willsky(MIT).

**Bio:** Pablo A. Parrilo received an Electronics Engineering undergraduate degree from the University of Buenos Aires, and a Ph.D. in Control and Dynamical Systems from the California Institute of Technology. He is currently a Professor at the Department of Electrical Engineering and Computer Science of the Massachusetts Institute of Technology, where he is also affiliated with the Laboratory for Information and Decision Systems (LIDS) and the Operations Research Center (ORC). His research interests include optimization and game theory methods for engineering applications, control and identification of uncertain complex systems, robustness analysis and synthesis, and the development and application of computational tools based on convex optimization and algorithmic algebra to practically relevant engineering problems.

(an ML talk in the CIS Speaker Series)

Tue 03/01/11, 10:45am, Hackerman B17**Learning Hierarchical Generative Models***Ruslan Salakhutdinov, MIT*

**Abstract:** Building intelligent systems that are capable of extracting meaningful representations from high-dimensional data lies at the core of solving many Artificial Intelligence tasks, including visual object recognition, information retrieval, speech perception, and language understanding. My research aims to discover such representations by learning rich generative models which contain deep hierarchical structure and which support inferences at multiple levels. In this talk, I will introduce a broad class of probabilistic generative models called Deep Boltzmann Machines (DBMs), and a new algorithm for learning these models that uses variational methods and Markov chain Monte Carlo. I will show that DBMs can learn useful hierarchical representations from large volumes of high-dimensional data, and that they can be successfully applied in many domains, including information retrieval, object recognition, and nonlinear dimensionality reduction. I will then describe a new class of more complex probabilistic graphical models that combine Deep Boltzmann Machines with structured hierarchical Bayesian models. I will show how these models can learn a deep hierarchical structure for sharing knowledge across hundreds of visual categories, which allows accurate learning of novel visual concepts from few examples.

**Bio:** Ruslan Salakhutdinov received his PhD in computer science from the University of Toronto in 2009, and he is now a postdoctoral associate at CSAIL and the Department of Brain and Cognitive Sciences at MIT. His research interests lie in machine learning, computational statistics, and large-scale optimization. He is the recipient of the NSERC Postdoctoral Fellowship and Canada Graduate Scholarship.

(an ML talk in the CS Speaker Series)

Thu 02/24/11, 10:45am, Hackerman B17**Get Another Label? Improving Data Quality and Machine Learning using Multiple, Noisy Labelers***Panos Ipeirotis, New York University*

**Abstract:** I will discuss the repeated acquisition of “labels” for data items when the labeling is imperfect. Labels are values provided by humans for specified variables on data items, such as “PG-13” for “Adult Content Rating on this Web Page.” With the increasing popularity of micro-outsourcing systems, such as Amazon’s Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. We present repeated-labeling strategies of increasing complexity, and show several main results: (i) Repeated-labeling can improve label quality and model quality (per unit data-acquisition cost), but not always. (ii) Simple strategies can give considerable advantage, and carefully selecting a chosen set of points for labeling does even better (we present and evaluate several techniques). (iii) Labeler (worker) quality can be estimated on the fly (e.g., to determine compensation, control quality or eliminate Mechanical Turk spammers) and systematic biases can be corrected. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data modelers should have in their repertoire. I illustrate the results with a real-life application from on-line advertising: using Mechanical Turk to help classify web pages as being objectionable to advertisers. This is joint work with Foster Provost, Victor S. Sheng, and Jing Wang. An earlier version of the work received the Best Paper Award Runner-up at the ACM SIGKDD Conference.

**Bio:** Panos Ipeirotis is an Associate Professor at the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business of New York University. His recent research interests focus on crowdsourcing and on mining user-generated content on the Internet. He received his Ph.D. degree in Computer Science from Columbia University in 2004, with distinction. He has received two “Best Paper” awards (IEEE ICDE 2005, ACM SIGMOD 2006), two “Best Paper Runner Up” awards (JCDL 2002, ACM KDD 2008), and is also a recipient of a CAREER award from the National Science Foundation.

(an ML talk in the CS Speaker Series)

Tue 02/15/11, 10:45am, Hackerman B17**Understanding the World with Infinite Models and Finite Computation***Ryan Adams, University of Toronto and Canadian Institute for Advanced Research*

**Abstract:** We are undergoing a revolution in data. As computer scientists, we have grown accustomed to constant upheaval in computing resources — quicker processors, bigger storage and faster networks — but this century presents the new challenge of almost unlimited access to raw information. Whether from sensor networks, social computing or high-throughput cell biology, we face a deluge of data about our world. We need to parse this information, to understand it, to use it to make better decisions. In this talk, I will discuss my work to confront this new challenge, developing new machine learning algorithms that are based on infinitely-large probabilistic graphical models. In principle, these infinite representations allow us to analyze sophisticated and dynamic phenomena in a way that automatically balances simplicity and complexity — a mathematical Occam’s Razor. Our computers, however, are inevitably finite, so how can we use such tools in practice? I will discuss how my approach leverages ideas from mathematical statistics to develop practical algorithms for inference in infinite models with finite computation. I will discuss how combining a firm theoretical footing with practical computational concerns gives us tools that are useful both within computer science and beyond, in domains such as computer vision, computational neuroscience, biology and the social sciences.

**Bio:** Ryan Adams is a Junior Research Fellow in the University of Toronto Department of Computer Science, affiliated with the Canadian Institute for Advanced Research. He received his Ph.D. in Physics from Cambridge University, where he was a Gates Cambridge Scholar under Prof. David MacKay. Ryan grew up in Texas, but completed his undergraduate work in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. He has received several awards for his research, including Best Paper at the 13th International Conference on Artificial Intelligence and Statistics.

(an ML talk in the CS Speaker Series)

Mon 02/14/11, 10:00am, COE (Stieff Building), North Conference room**Markov Logic in Machine Reading***Hoifung Poon, University of Washington*

**Abstract:** A long-standing goal of AI and natural language processing is to harness human knowledge by automatically understanding texts. Known as machine reading, it has become increasingly urgent with the rise of billions of web documents. To represent the acquired knowledge that is complex and heterogeneous, we need first-order logic. To handle the inherent uncertainty and ambiguity in extracting and reasoning with knowledge, we need probability. Combining the two has led to rapid progress in the emerging field of statistical relational learning. In this talk, I will show that statistical relational learning offers promising solutions for machine reading. I will present Markov logic, which is a leading unifying framework for statistical relational learning, and has spawned a number of successful applications for machine reading. In particular, I will present USP, an end-to-end machine reading system that can read text, extract knowledge and answer questions, all without any labeled examples. To resolve linguistic variations for the same meaning, USP recursively clusters expressions that are composed with or by similar expressions. In a machine reading experiment, USP extracted five times as many correct answers compared to state-of-the-art systems such as Text Runner, and raised accuracy from below 60% to 91%.

**Bio:** Hoifung Poon is a final-year Ph.D. student at the University of Washington, working with Pedro Domingos. His main research interest lies in advancing machine learning methods to handle both complexity and uncertainty, and in applying them to solving challenging natural language processing problems with few labeled examples. His most recent work developed unsupervised learning methods for a number of NLP problems ranging from morphological segmentation to machine reading, and received the Best Paper Awards in NAACL and EMNLP.

(an ML talk in the HLTCOE Speaker Series)

Tue 02/08/11, 04:30pm, Hackerman B17**A Scalable Distributed Syntactic, Semantic and Lexical Language Model***Shaojun Wang, Wright State University*

**Abstract:** In this talk, I’ll present an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating n-gram, structured language model and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.

**Bio:** Shaojun Wang received his B.S. and M.S. in Electrical Engineering at Tsinghua University in 1988 and 1992 respectively, M.S. in Mathematics and Ph.D. in Electrical Engineering at the University of Illinois at Urbana-Champaign in 1998 and 2001 respectively. From 2001 to 2005, he worked at CMU, Waterloo and University of Alberta as a post-doctoral fellow. He joined the Department of Computer Science and Engineering at Wright State University as an assistant professor in 2006. His research interest is statistical machine learning, natural language processing, and cloud computing. He is now mainly focusing on two projects: large scale distributed language modeling and semi-supervised discriminative structured prediction, that are funded by NSF, Google and AFOSR. Both emphasize on scalability and parallel/distributed approaches to process extremely large scale datasets.

(an ML talk in the CLSP Speaker Series)

Tue 12/14/10, 01:00pm, Clark 314**Efficient Additive Kernels via Explicit Feature Maps***Andrea Vedaldi, University of Oxford*

**Abstract:** Most state-of-the-art visual object detectors make use of kernel methods to learn the appearance of objects, which depends in complex ways on intra-class variations, viewpoint and illumination changes, and clutter. Unfortunately the resulting learning problems are usually very large and can be solved efficiently only for simple models such as the linear support vector machines (SVMs). We introduce a technique to extend the efficient training of linear SVMs to the much broader class of additive homogeneous kernel SVMs (including the χ2, intersection, and Jensen-Shannon kernels). Similar to the Nyström approximation, we express the homogeneous kernels as linear ones by transforming the data through a feature map. Compared to the general case, however, working with homogeneous kernels is much more efficient: our feature maps are available in closed form, are fast to compute and low dimensional, and the approximation error can be fully characterized. We demonstrate that the approximations have indistinguishable performance from the full kernels on a number of standard datasets, yet greatly reduce the train/test times of SVMs. In particular, we show that the χ2 kernel, which has been found to yield the best performance in most applications, also has the most compact feature representation.

**Bio:** Andrea Vedaldi received the BSc degree (with honors) from the Information Engineering Department, University of Padua, Italy, in 2003 and the MSc and PhD degrees from the Computer Science Department, University of California at Los Angeles, in 2005 and 2008. Since 2008 he his research fellow at the University of Oxford, UK. His research interests include detection and recognition of visual object categories, visual representations, and large scale machine learning applied to computer vision. He is the recipient of the “Outstanding Doctor of Philosophy in Computer Science” and “Outstanding Master of Science in Computer Science” awards of University of California at Los Angeles.

(an ML talk in the CIS Speaker Series)

Tue 11/23/10, 10:45am, Hackerman B17**Machine Learning and Multiagent Reasoning: From Robot Soccer to Autonomous Traffic***Peter Stone, University of Texas at Austin*

**Abstract:** One goal of Artificial Intelligence is to enable the creation of robust, fully autonomous agents that can coexist with us in the real world. Such agents will need to be able to learn, both in order to correct and circumvent their inevitable imperfections, and to keep up with a dynamically changing world. They will also need to be able to interact with one another, whether they share common goals, they pursue independent goals, or their goals are in direct conflict. This talk will present current research directions in machine learning, multiagent reasoning, and robotics, and will advocate their unification within concrete application domains. Ideally, new theoretical results in each separate area will inform practical implementations while innovations from concrete multiagent applications will drive new theoretical pursuits, and together these synergistic research approaches will lead us towards the goal of fully autonomous agents.

**Bio:** Dr. Peter Stone is an Alfred P. Sloan Research Fellow, Guggenheim Fellow, Fulbright Scholar, and Associate Professor in the Department of Computer Sciences at the University of Texas at Austin. He received his Ph.D in Computer Science in 1998 from Carnegie Mellon University. From 1999 to 2002 he was a Senior Technical Staff Member in the Artificial Intelligence Principles Research Department at AT&T Labs – Research. Peter’s research interests include machine learning, multiagent systems, robotics, and e-commerce. In 2003, he won a CAREER award from the National Science Foundation for his research on learning agents in dynamic, collaborative, and adversarial multiagent environments. In 2004, he was named an ONR Young Investigator for his research on machine learning on physical robots. In 2007, he was awarded the prestigious IJCAI 2007 Computers and Thought award, given once every two years to the top AI researcher under the age of 35.

(an ML talk in the CS Speaker Series)

Fri 11/05/10, 09:00am, Clark 110**A Non-Parametric Model for the Discovery of Inflectional Paradigms from Plain Text using Graphical Models over Strings***Markus Dreyer, Johns Hopkins University*

**Abstract:** Statistical natural language processing can be difficult for morphologically rich languages. The observed vocabularies of such languages are very large, since each word may have been inflected for morphological properties like person, number, gender, tense, or others. This unfortunately masks important generalizations, leads to problems with data sparseness and makes it hard to generate correctly inflected text. This thesis tackles the problem of inflectional morphology with a novel, unified statistical approach. We present a generative probability model that can be used to learn from plain text how the words of a language are inflected, given some minimal supervision. In other words, we discover the inflectional paradigms that are implicit, or hidden, in a large unannotated text corpus. This model consists of several components: a hierarchical Dirichlet process clusters word tokens of the corpus into lexemes and their inflections, and graphical models over strings — a novel graphical-model variant — model the interactions of multiple morphologically related type spellings, using weighted finite-state transducers as potential functions. We present the components of this model, from weighted finite-state transducers parameterized as log-linear models, to graphical models over multiple strings, to the final non-parametric model over a corpus, its lexemes, inflections, and paradigms. We show experimental results for several tasks along the way, including a lemmatization task in multiple languages and, to demonstrate that parts of our model are applicable outside of morphology as well, a transliteration task. Finally, we show that learning from large unannotated text corpora under our non-parametric model significantly improves the quality of predicted word inflections.

**Bio:** Markus Dreyer is a Ph.D. student in the Computer Science Department at Johns Hopkins University, working with Jason Eisner. He is a member of the Center for Language and Speech Processing and the Human Language Technology Center of Excellence. Before coming to Johns Hopkins, he obtained a Magister degree at the University of Heidelberg and worked in the IBM Speech Research Group in Germany.

Thu 11/04/10, 10:45am, Hackerman B17**Estimating Ultra-large Phylogenies and Alignments***Tandy Warnow, University of Texas at Austin*

**Abstract:** Biomolecular sequences evolve under processes that include substitutions, insertions and deletions (indels), as well as other events, such as duplications. The estimation of evolutionary history from sequences is then used to answer fundamental questions about biology, and also has applications in a wide range of biomedical research. From a computational perspective, however, phylogenetic (evolutionary) tree estimation is enormously hard: all favored approaches are NP-hard, and even the best heuristics can take months or years on only moderately large datasets. Furthermore, while there are very good heuristics for estimating trees from sequences that are already placed in a multiple alignment (a step that is used when sequences evolve with indels), errors in alignment estimation produce errors in tree estimation, and the standard alignment estimation methods fail to produce highly accurate alignments on large highly divergent datasets. Thus, the estimation of highly accurate phylogenetic trees from large datasets of unaligned sequences is beyond the scope of standard methods. In this talk, I will describe new algorithmic tools that my group has developed, and which make it possible, for the first time, to obtain highly accurate estimates of trees from very large datasets, even when the sequences have evolved under high rates of substitution and indels. In particular, I will describe SAT´e (Liu et al. 2009, Science Vol 324, no. 5934). SAT´e simultaneously estimates a tree and alignment; our study shows that SAT´e is shows that SAT´e is very fast, and produces dramatically more accurate trees and alignments than competing methods, even on datasets with 1000 taxa and high rates of indels and substitutions. I will also describe our new method, DACTAL (not yet submitted). DACTAL stands for “Divide-and-Conquer Trees without Alignments”, and uses an iterative procedure combined with a novel divide-and-conquer strategy to estimate trees from unaligned sequences. Our study, using both real and simulated data, shows that DACTAL produces trees of higher accuracy than SAT´e, and does so without ever constructing an alignment on the entire set of sequences. Furthermore, DACTAL is extremely fast, producing highly accurate estimates of datasets in a few days that take many other methods years. Time permitting, I will show how DACTAL can be used to improve the speed and accuracy of other phylogeny reconstruction methods, and in particular in the context of phylogenetic analyses of whole genomes.

**Bio:** Tandy Warnow is David Bruton Jr. Centennial Professor of Computer Sciences at the University of Texas at Austin. Her research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics. Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavare and Michael Waterman at USC. She received the National Science Foundation Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006, and a Guggenheim Foundation Fellowship for 2011. Tandy is a member of five graduate programs at the University of Texas, including Computer Science; Ecology, Evolution, and Behavior; Molecular and Cellular Biology; Mathematics; and Computational and Applied Mathematics. Her current research focuses on phylogeny and alignment estimation for very large datasets (10,000 to 500,000 sequences), estimating species trees from collections of gene trees, and genome rearrangement phylogeny estimation.

Thu 09/23/10, 01:30pm, Whitehead 304**Manifold Matching: Joint Optimization of Fidelity and Commensurability***Carey E. Priebe, Johns Hopkins University*

**Abstract:** Fusion and inference from multiple and massive disparate data sources – the requirement for our most challenging data analysis problems and the goal of our most ambitious statistical pattern recognition methodologies – has many and varied aspects which are currently the target of intense research and development. One aspect of the overall challenge is manifold matching – identifying embeddings of multiple disparate data spaces into the same low-dimensional space where joint inference can be pursued. We investigate this manifold matching task from the perspective of jointly optimizing the fidelity of the embeddings and their commensurability with one another, with a specific statistical inference exploitation task in mind. Our results demonstrate when and why our joint optimization methodology is superior to either version of separate optimization. The methodology is illustrated with simulations and an application in document matching.

(an ML talk in the AMS Speaker Series)

Tue 09/07/10, 04:30pm, Hackerman B17**Lifted Message Passing***Kristian Kersting, University of Bonn*

**Abstract:** Many AI inference problems arising in a wide variety of fields such as network communication, activity recognition, computer vision, machine learning, and robotics can be solved using message-passing algorithms that operate on factor graphs. Often, however, we are facing inference problems with symmetries not reflected in the factor graph structure and, hence, not exploitable by efficient message-passing algorithms. In this talk, I will survey lifted message-passing algorithms that exploit additional symmetries. Starting from a given factor graph, they essentially first construct a lifted factor graph of supernodes and superfactors, corresponding to sets of nodes and factors that send and receive the same messages, i.e., that are indistinguishable given the evidence. Then they run a modified message-passing algorithm on the lifted factor. In particular, I will present lifted variants of loopy and Gaussian belief propagation as well as warning and survey propagation, and demonstrate that significant efficiency gains are obtainable, often by orders of magnitude. This talk is based on collaborations with Babak Ahmadi, Youssef El Massaoudi, Fabian Hadiji, Sriraam Natarajan, and Scott Sanner.

**Bio:** Kristian Kersting is the head of the “statisitcal relational activity mining” (STREAM) group at Fraunhofer IAIS, Bonn, Germany, a research fellow of the University of Bonn, Germany, and a research affiliate of the Massachusetts Institute of Technology (MIT), USA. He received his Ph.D. from the University of Freiburg, Germany, in 2006. After a PostDoc at MIT, he joined Fraunhofer IAIS in 2008 to build up the STREAM research group using an ATTRACT Fellowship. His main research interests are statistical relational reasoning and learning (SRL), acting under uncertainty, and robotics. He has published over 60 peer-reviewed papers, has received the ECML Best Student Paper Award in 2006 and the ECCAI Dissertation Award 2006 for the best European dissertation in the field of AI, and is an ERCIM Cor Baayen Award 2009 finalist for the “Most Promising Young Researcher In Europe in Computer Science and Applied Mathematics”. He gave several tutorials at top conferences (AAAI, ECML-PKDD, ICAPS, IDA, ICML, ILP) and co-chaired MLG-07, SRL-09 and the recent AAAI-10 workshop on Statistical Relational AI (StarAI-10). He (will) serve(d) as area chair for ECML-06, ECML-07, ICML-10, as Senior PC member at IJCAO-10, and on the PC of several top conference (IJCAI, AAAI, ICML, KDD, RSS, ECAI, ECML/PKDD, ICML, ILP,…). He was a guest co-editor for special issues of the Annals of Mathematics and AI (AMAI), the Journal of Machine Learning Research (JMLR), and the Machine Learning Journal (MLJ). Currently, he serves on the editorial board of the Machine Learning Journal (MLJ) and the Journal of Artififical Intelligence Research (JAIR).

(an ML talk in the CLSP Speaker Series)

Wed 06/23/10, 10:30am, Hackerman B17**Human Activity Recognition Using Simple Direct Sensors***Henry Kautz, University of Rochester*

**Abstract:** Simple, inexpensive sensors, including RFID-based object touch sensors, GPS location sensors, and cell-phone quality accelerometers, can be used to detect and distinguish a wide range of human activities with surprisingly high accuracy. I will provide an overview of hardware and algorithms for direct sensing, and speculate about how such sensor data could be used for embodied language and task learning.

**Bio:** Henry Kautz is Chair of the Department of Computer Science at the University of Rochester. He performs research in knowledge representation, satisfiability testing, pervasive computing, and assistive technology. He was a Professor in the Department of Computer Science and Engineering of the University of Washington from 2000 to 2006, after a career at Bell Labs and AT&T Laboratories, where he was Head of the AI Principles Research Department. His academic degrees include an A.B. in mathematics from Cornell University, an M.A. in Creative Writing from the Johns Hopkins University, an M.Sc. in Computer Science from the University of Toronto, and a Ph.D. in computer science from the University of Rochester. He is a Fellow of the American Association for the Advancement of Science, President (2010-2012) of the Association for the Advancement of Artificial Intelligence, Fellow of the AAAI, and a recipient of the IJCAI Computers and Thought Award.

(an ML talk in the CLSP Speaker Series)

Wed 05/12/10, 04:00pm, Clark 110**Genes, Networks and Disease***Joel Bader, Johns Hopkins University*

**Abstract:** Identifying disease-related genetic variants and their downstream effects on gene and protein pathways is having great impact in biology and medicine. Reverse engineering these pathways could lead to methods that control cellular behavior for therapeutic and bioengineering applications. This talk describes new methods that harness biological intuition to improve the computational search for disease-related genes and to predict their effects on dynamic cellular networks. Standard methods for identifying disease-related variants do not take advantage of our knowledge of genes. We have formulated new methods that use Bayesian model selection to perform tests gene-by-gene rather than SNP-by-SNP. Applied to cardiac phenotypes, these gene-based tests find more disease-related genes than SNP-based tests and state-of-the-art L1 regularization. We provide systematic evidence supporting the hypothesis that disease-risk genes have multiple independent risk-enhancing variants. To analyze dynamic processes in disease and cellular development, we have developed dynamic network models that integrate biological interaction data, including epistatic and regulatory interactions between genes and physical interactions between proteins, with time-dependent readouts from mRNA profiling. The computational realization is closely related to stochastic block models for social networks, spectral clustering for image segmentation, and graph diffusion for network search. Results reveal that dynamically changing protein complexes are built on static cores, and that human-curated pathways systematically separate genes with close functional associations. Applied to a neural developmental disorder, these methods implicate an unsuspected metabolite as a control point in neural stem cell development.

Thu 04/29/10, 04:30pm, COE (Stieff Building), North Conference room**Natural Language Processing in Multiple Domains: Linking the Unknown to the Known***John Blitzer, UC Berkeley*

**Abstract:** The key to creating scalable, robust natural language processing (NLP) systems is to exploit correspondences between known and unknown linguistic structure. Natural language processing has experienced tremendous success over the past two decades, but our most successful systems are still limited to the domains and languages where we have large amounts of hand-annotated data. Unfortunately, these domains and languages represent a tiny portion of the total linguistic data in the world. No matter the task, we always encounter unknown linguistic features like words and syntactic constituents that we have never observed before when estimating our models. This talk is about linking these unknown linguistic features to ones we already know for more robust learning across domains and languages. The first part of the talk describes a technique to link lexical items across domains for better sentiment analysis systems. These systems predict the general attitude of an essay toward a particular topic, but words which are highly predictive in one domain may not be present in another. We show how to build a correspondence representation between words in different domains using projections to low-dimensional, real-valued spaces. Unknown words are projected onto this representation and related directly to known features via Euclidean distance. Our correspondence representation allows us to train significantly more robust models in new domains, and we achieve a 40% reduction in error due to adaptation over a state-of-the-art system. The second part of the talk describes a technique to link syntactic constituents across languages for better syntactic machine translation. Syntactic machine translation models depend crucially on syntactic parsers that are accurate on bilingual text (text which consists of sentence pairs in multiple languages). Unfortunately, modern parsers perform significantly worse on bilingual text than on the controlled newswire corpora on which they’re evaluated. Our model adapts parsers to bilingual text by linking known and unknown syntactic features across languages through a latent correspondence grammar. This grammar significantly improves parser performance on parallel text, as well as improving a downstream Chinese-English machine translation system.

**Bio:** John Blitzer is a postdoctoral fellow in the computer science department at the University of California, Berkeley, working with Dan Klein. He completed his PhD in computer science at the University of Pennsylvania under Fernando Pereira. John’s research focuses on applications of machine learning to natural language. In particular, he is interested in exploiting unlabeled data and other sources of side information to improve supervised models. He has applied these techniques to tagging, parsing, entity recognition, web search, and machine translation. To learn more about John’s research interests, please visit his web page: http://john.blitzer.com

(an ML talk in the HLTCOE Speaker Series)

Thu 04/29/10, 01:30pm, Whitehead 304**Information-Theoretic Validation of L1 Penalized Likelihood***Andrew Barron, Yale University*

**Abstract:** Simplicity and likelihood are linked by principles of information theory, armed with which various likelihood penalties are developed for model selection. We provide methods for analysis of these procedures based both on properties of statistical risk and properties of data compression. It is classical that penalties based on number of parameters have these information-theoretic properties, as we review. Here we show also that penalities based on L1 norms of coefficients in regression are both risk valid and codelength valid.

(an ML talk in the AMS Speaker Series)

Tue 04/27/10, 04:30pm, Hackerman B17**Deep Learning with Multiplicative Interactions***Geoffrey Hinton, University of Toronto and Canadian Institute for Advanced Research*

**Abstract:** Deep networks can be learned efficiently from unlabeled data. The layers of representation are learned one at a time using a simple learning module that has only one layer of latent variables. The values of the latent variables of one module form the data for training the next module. Although deep networks have been quite successful for tasks such as object recognition, information retrieval, and modeling motion capture data, the simple learning modules do not have multiplicative interactions which are very useful for some types of data. The talk will show how to introduce multiplicative interactions into the basic learning module in a way that preserves the simple rules for learning and perceptual inference. The new module has a structure that is very similar to the simple cell/complex cell hierarchy that is found in visual cortex. The multiplicative interactions are useful for modeling images, image transformations and different styles of human walking. They can also be used to create generative models of spectrograms. The features learned by these generative models are excellent for phone recognition.

(an ML talk in the CLSP Speaker Series)

Wed 04/21/10, 03:45pm, Krieger 134A**How to Grow a Mind: Statistics, Structure and Abstraction***Josh Tenenbaum, MIT*

**Abstract:** How do humans come to know so much about the world from so little data? Even young children can infer the meanings of words, the hidden properties of objects, or the existence of causal relations from just one or a few relevant observations — far outstripping the capabilities of conventional learning machines. How do they do it? And how can we bring machines closer to these human-like learning abilities? I will argue that people’s everyday inductive leaps can be understood as a form of intuitive statistics — in particular, as approximations to Bayesian inference over probabilistic models of the world. These models can have rich latent structure based on abstract knowledge representations, what cognitive psychologists have sometimes called “intuitive theories”, “mental models”, or “schemas”. They also typically have a hierarchical structure supporting inference at multiple levels, or “learning to learn”, where abstract knowledge may itself be learned from experience at the same time as it guides more specific generalizations from sparse data. This talk will focus on models of learning and “learning to learn” about categories, word meanings and causal relations. I will show in each of these settings how human learners can balance the need for strongly constraining inductive biases — necessary for rapid generalization — with the flexibility to adapt to the structure of new environments, learning new inductive biases for which our minds could not have been pre-programmed. I will also discuss briefly how this approach extends to richer forms of knowledge, such as intuitive psychology and social inferences, physical reasoning, and natural number.

(an ML talk in the CogSci Speaker Series)

Tue 04/20/10, 10:45am, Hackerman B17**Building Confidence in Online Learning***Mark Dredze, Johns Hopkins University*

**Abstract:** The information revolution has produced huge quantities of digitized knowledge. Information users, such as web searchers, business analysts, and medical professionals, are overwhelmed by vast quantities of information. As new information sources move online, information overload will worsen and the need for intelligent information systems will grow. The recent focus on information processing in statistical methods has produced numerous high quality tools for processing language, including knowledge extraction, organization and analysis. With more data and better statistical methods, the state of the art advances. However, these statistical methods can have difficulty scaling up to huge quantities of diverse data. This talk will present techniques designed for processing large data collections, with a particular focus on sparse representations common to many domains with a large number of features. I will present Confidence Weighted Learning, an online (streaming) machine learning algorithm designed for these types of data distributions. Confidence weighted learning maintains a distribution over linear classifiers and updates the distribution after each example. I’ll show how this framework can be extended to multi-class and structured prediction problems, as well as extensions for modeling seconds order feature interactions and noisy data.

**Bio:** Mark Dredze is as an Assistant Research Professor in the department of Computer Science and a Senior Research Scientist at the Human Language Technology Center of Excellence at The Johns Hopkins University. His research interests include machine learning, natural language processing and intelligent user interfaces. His focus is on novel applications of machine learning to solve language processing challenges as well as applications of machine learning and natural language processing to support intelligent user interfaces for information management. He earned his PhD from the University of Pennsylvania and has worked at Google, IBM and Microsoft.

(an ML talk in the CS Speaker Series)

Mon 04/19/10, 11:00am, COE (Stieff Building), North Conference room**A Step Towards Fully Unsupervised, Life-Long, Incremental Learning***Frank Wood, Columbia University*

**Abstract:** Traditional statistical tools and methods are designed to allow inference (with confidence) about a population from a small sample. While this style of statistics will always have a place, it is now the case that one often can access so much data that parametric models themselves are sometimes not even necessary. One can simply “look at the data.” This talk will touch on a different approach to learning and inference, one in which we embrace the flood of data while still preserving a role for a model. We will advocate using models with very high complexity for their ability to support complex inference, and show a Bayesian nonparametric example of such a model for discrete data. Due to both model complexity and data volume, computational considerations arise quickly. We will show how, in one case, a careful marriage of probability and algorithmic theory gives rise to a practical, unsupervised, life-long, incremental learning algorithm that can be used in a wide-variety of settings. We will demonstrate this algorithm in one such application, general purpose lossless compression.

(an ML talk in the HLTCOE Speaker Series)

Tue 03/30/10, 10:45am, Hackerman B17**People-Aware Computing: Towards Societal Scale Sensing using Mobile Phones***Tanzeem Choudhury, Dartmouth University*

**Abstract:** A great variety of sensors are being built into mobile phones today. This opens up a new research frontier, where mobile devices have the potential to significantly impact many aspects of our everyday life: from health-care, to sustainability, safety, entertainment, and business. However, the broad impact of this vision will be jeopardized without advances in the computational models, which turn raw sensor data into inferences (ranging from recognizing physical activities to tracking community-wide social interaction patterns). My group focuses on developing mobile sensing and machine learning techniques for analyzing and interpreting the behavior of individuals and social groups, including their context, activities, and social networks. Although solutions to this problem using standard machine learning techniques is possible, how they can be solved efficiently without requiring significant effort from the end-users is still an open problem. In this talk I will describe the research we have done to bridge this gap and advance the state of the art of people-aware computing, by developing novel learning algorithms and evaluating resulting systems through a series of real-world deployments. I will conclude by providing some specific examples of how the methods we have developed can be applied to better understand and enhance the lives of people.

**Bio:** Tanzeem Choudhury is an assistant professor in the computer science department at Dartmouth. She joined Dartmouth in 2008 after four years at Intel Research Seattle. She received her PhD from the Media Laboratory at MIT. Tanzeem develops systems that can reason about human activities, interactions, and social networks in everyday environments. Tanzeem’s doctoral thesis demonstrated for the first time the feasibility of using wearable sensors to capture and model social networks automatically, on the basis of face-to-face conversations. MIT Technology Review recognized her as one of the top 35 innovators under the age of 35 (2008 TR35) for her work in this area. Tanzeem has also been selected as a TED Fellow and is a recipient of the NSF CAREER award. More information can be found at Tanzeem’s webpage: http://www.cs.dartmouth.edu/~tanzeem.

(an ML talk in the CS Speaker Series)

Thu 03/25/10, 10:45am, Hackerman B17**Approximate Inference in Graphical Models using LP Relaxations***David Sontag, MIT*

**Abstract:** Graphical models such as Markov random fields have been successfully applied to a wide variety of fields, from computer vision and natural language processing, to computational biology. Exact probabilistic inference is generally intractable in complex models having many dependencies between the variables. In this talk, I will discuss recent work on using linear programming relaxations to perform approximate inference. By solving the LP relaxations in the dual, we obtain efficient message-passing algorithms that, when the relaxations are tight, can provably find the most likely (MAP) configuration. Our algorithms succeed at finding the MAP configuration in protein side-chain placement, protein design, and stereo vision problems. More broadly, this talk will highlight emerging connections between machine learning, polyhedral combinatorics, and combinatorial optimization.

**Bio:** David is a Ph.D. candidate in Computer Science at MIT. He received his Bachelor’s degree in Computer Science from the University of California, Berkeley in 2005. His research focuses on theory and practical algorithms for learning and probabilistic inference in large statistical models. His work has been awarded with an outstanding student paper award at NIPS in 2007 and a best paper award at UAI in 2008. He currently has the Google Fellowship in Machine Learning.

(an ML talk in the CS Speaker Series)

Wed 03/24/10, 12:00pm, Hackerman B17**Apprenticeship Learning for Robotic Control with Application to Quadruped Locomotion and Autonomous Helicopter Flight***Pieter Abbeel, University of California, Berkeley*

**Abstract:** Many problems in robotics have unknown, stochastic, high-dimensional, and highly non-linear dynamics, and offer significant challenges to classical control methods. Some of the key difficulties in these problems are that (i) It is often hard to write down, in closed form, a formal specification of the control task (for example, what is the objective function for “flying well”?), (ii) It is difficult to build a good dynamics model because of both data collection and data modeling challenges (similar to the “exploration problem” in reinforcement learning), and (iii) It is expensive to find closed-loop controllers for high dimensional, stochastic domains. In this talk, I will present learning algorithms which show that these problems can be efficiently addressed in the apprenticeship learning setting—the setting when expert demonstrations of the task are available. I will also present how our apprenticeship learning techniques have enabled us to solve real-world control problems that could not be solved before: They have enabled a quadruped robot to traverse challenging terrain, and a helicopter to perform by far the most challenging aerobatic maneuvers performed by any autonomous helicopter to date, including maneuvers such as chaos and tic-tocs, which only exceptional expert human pilots can fly.

**Bio:** Pieter Abbeel received his Ph.D. degree in Computer Science from Stanford University in 2008. He is currently an assistant professor at UC Berkeley’s EECS department. His research focuses on robotics, machine learning and control. For more information, see www.cs.berkeley.edu/~pabbeel.

(an ML talk in the LCSR Speaker Series)

Tue 03/23/10, 10:45am, Hackerman B17**Understanding the Genetic Basis of Complex Diseases via Genome-Phenome Association***Seyoung Kim, Carnegie Mellon University*

**Abstract:** Genome-wide association studies have recently become popular as a tool for identifying the genetic loci that are responsible for increased disease susceptibility by examining genetic and phenotypic variation across a large number of individuals. The cause of many complex disease syndromes involves the complex interplay of a large number of genomic variations that perturb disease-related genes in the context of a regulatory network. As patient cohorts are routinely surveyed for a large number of traits such as hundreds of clinical phenotypes and genome-wide profiling for thousands of gene expressions, this raises new computational challenges in identifying genetic variations associated simultaneously with multiple correlated traits. In this talk, I will present algorithms that go beyond the traditional approach of examining the correlation between a single genetic marker and a single trait. Our algorithms build on a sparse regression method in statistics, and are able to discover genetic variants that perturb modules of correlated molecular and clinical phenotypes during genome-phenome association mapping. Our approach is significantly better at detecting associations when genetic markers influence synergistically a group of traits.

**Bio:** Seyoung Kim is currently a project scientist in the Machine Learning Department at Carnegie Mellon University. Her work as a postdoctoral fellow and project scientist at Carnegie Mellon University has included developing machine-learning algorithms for disease association mapping. She received her Ph.D. in computer science from the University of California, Irvine, in 2007. During her Ph.D., she worked on statistical machine learning methods for problems in biomedical domain.

(an ML talk in the CS Speaker Series)

Thu 03/11/10, 10:45am, Hackerman B17**Hierarchical Bayesian Methods for Reinforcement Learning***David Wingate, MIT*

**Abstract:** Designing autonomous agents capable of coping with the complexity of the real world is a tremendous engineering challenge. Such agents must often deal with rich observations (such as images), unknown dynamics, and rich structure—perhaps consisting of objects, their properties/types and their dynamical interactions. An ability to learn from experience and generalize radically to new situations is essential; at the same time, the agent may bring substantial prior knowledge to bear on the environment it finds itself in. In this talk, I will present recent work on the combination of reinforcement learning and nonparametric Bayesian modeling. Hierarchical Bayes provides a principled framework for incorporating prior knowledge and dealing explicitly with uncertainty, while reinforcement learning provides a framework for making sequential decisions under uncertainty. I will discuss how nonparametric Bayesian models can help answer two questions: 1) how can an agent learn a representation of state space in a structured domain? and 2) how can an agent learn how to search for good control laws in hard-to-search spaces? I will illustrate the concepts on applications including modeling neural spike train data, causal sound source separation and optimal control in high-dimensional, simulated robotic environments.

**Bio:** David Wingate received a B.S. and M.S. in Computer Science from Brigham Young University in 2002 and 2004, and a Ph.D. in Computer Science from University of Michigan in 2008. He is currently a postdoctoral research associate in the Computational Cognitive Science group at MIT. David’s research interests lie at the intersection of perception, control and cognition. His research spans diverse topics in reinforcement learning, Bayesian unsupervised learning, information theory, manifold learning, kernel methods, massively parallel processing, visual perception, and optimal control.

(an ML talk in the CS Speaker Series)

Mon 03/08/10, 11:00am, COE (Stieff Building), North Conference room**Doing More with Less…Labeled Data: New Directions in Semi-Supervised Learning***Andrew Goldberg, University of Wisconsin-Madison*

**Abstract:** This talk describes several recent advances in semi-supervised learning (SSL), the machine learning paradigm that tries to exploit unlabeled data to build better classifiers and predictors. By reducing the amount of expensive and time-consuming labeled data required, SSL methods are of great practical value, especially in text, speech, image, and Web applications. After explaining why and how we can exploit unlabeled data, I will present three novel contributions to graph-based semi-supervised learning drawn from my dissertation research. One piece of work introduces online (i.e., incremental) semi-supervised learning, which is especially appropriate for large-scale learning problems that evolve over time. The second contribution is a method for building graphs over complex data containing multiple intersecting manifolds varying in dimension, orientation, or density; such a graph enables improved clustering and label diffusion for semi-supervised prediction problems. The third contribution makes it possible to incorporate dissimilarity information into binary and multi-class graph-based semi-supervised classification. After discussing these topics in detail, I will touch on other non-dissertation work spanning topics such as query intent classification, diversity-based ranking, pictorial communication, sentiment analysis, and wish detection. Finally, the talk concludes with a discussion of current and future research interests in semi-supervised learning and beyond.

**Bio:** Andrew B. Goldberg is a Ph.D. candidate in the Computer Sciences department at the University of Wisconsin-Madison. His research interests lie in statistical machine learning, semi-supervised learning, natural language processing, information retrieval, and Web and text mining. He has published 11 peer-reviewed journal and conference papers, co-authored two books, and filed one patent. During graduate school, Andrew has held internships at Google Research and Microsoft Research. He has also served as a reviewer on the program committee for national and international conferences including ICML, AAAI, ACL, EMNLP, and NAACL-HLT. Prior to beginning his graduate studies in 2005, Andrew graduated magna cum laude from Amherst College in 2003 and then spent two years developing introductory computer science and Web programming textbooks at Deitel and Associates in the Boston area. In his free time, Andrew enjoys cooking, digital photography, and travel.

(an ML talk in the HLTCOE Speaker Series)

Thu 03/04/10, 04:00pm, Whitehead 304**Longitudinal Functional Principal Component Analysis***Ciprian M. Crainiceanu, Johns Hopkins University SPH*

**Abstract:** We introduce models for the analysis of functional data observed at multiple time points. The dynamic behavior of functional data is decomposed into a time-dependent population average, baseline (or static) subject-specific variability, longitudinal (or dynamic) subject-specific variability, subject-visit-specific variability and measurement error. The model can be viewed as the functional analog of the classical mixed effects model where random effects are replaced by random processes. Methods have wide applicability and are computationally feasible for moderate and large data sets. Computational feasibility is assured by using principal component bases for the functional processes. The methodology is motivated by and applied to a diffusion tensor imaging (DTI) study designed to analyze differences and changes in brain connectivity in healthy volunteers and multiple sclerosis (MS) patients.

(an ML talk in the AMS Speaker Series)

Tue 03/02/10, 01:00pm, Clark 314**Image Modeling and Enhancement with Structured Sparse Model Selection***Guoshen Yu, University of Minnesota*

**Abstract:** An image representation framework based on structured sparse model selection is introduced in this work. The corresponding modeling dictionary is comprised of a family of learned orthogonal bases. For an image patch, a model is first selected from this dictionary through linear approximation in a best basis, and the signal estimation is then calculated with the selected model. The model selection leads to a guaranteed near optimal denoising estimator. The degree of freedom in the model selection is equal to the number of the bases, typically about 10 for natural images, and is significantly lower than with traditional overcomplete dictionary approaches, stabilizing the representation. For an image patch of size \sqrt(N)xsqrt(N), the computational complexity of the proposed framework is O(N^2), typically 2 to 3 orders of magnitude faster than estimation in an overcomplete dictionary. The orthogonal bases are adapted to the image of interest and are computed with a simple and fast procedure. State-of-the-art results are shown in image denoising, deblurring, and inpainting.

**Bio:** Guoshen Yu received the B.Sc. degree in electronic engineering from Fudan University, Shanghai, China, in 2003, the engineering degree from Telecom ParisTech, Paris, France, in 2006, the M.Sc. degree in applied mathematics from ENS de Cachan, Cachan, France, in 2006, and the Ph.D. degree in applied mathematics from Ecole Polytechnique, Palaiseau, France, in 2009. He is now a Postdoctoral Research Associate at the Electrical and Computer Engineering Department, University of Minnesota, Twin Cities. He was a research intern with STMicroelectronics, Agrate, Italy, and with Let It Wave, Paris, France, for one year in 2005–2006. In spring 2008 semester, he was a visiting graduate student in the Mechanical Engineering Department, Massachusetts Institute of Technology (MIT), Cambridge. His research interests include signal, image, and video processing and computer vision.

(an ML talk in the CIS Speaker Series)

Mon 03/01/10, 11:00am, COE (Stieff Hall), North Conference room**New Learning Frameworks for Information Retrieval***Yisong Yue, Cornell University*

**Abstract:** Information retrieval has become a central technology in managing and leveraging the ongoing explosion of digital content. Current techniques for designing retrieval models are limited by two issues. First, they have restricted representational power, and generally deal with simple settings that estimate the quality of individual results independently of other results. Second, existing methodologies for designing retrieval functions are labor intensive and cannot be efficiently applied to accommodate a growing variety of retrieval domains. In this talk, I will describe two learning approaches for designing new retrieval models. The first is a structured prediction approach, which considers inter-dependencies between results in order to optimize for more sophisticated objectives such as information diversity. The second is an interactive learning approach, which reduces the efficiency bottleneck of relying on human experts by leveraging data gathered from online user interactions; such data is both cheap to collect as well as naturally representative of user utilities in the target domain. This is joint work with Thorsten Joachims.

**Bio:** Yisong Yue is a Ph.D. candidate at Cornell University, where he works on machine learning approaches to structured prediction and interactive systems, with an application focus in information retrieval. He is the author of the SVM-map software package for optimizing mean average precision using support vector machines, and he currently manages the experimental search service for the Physics E-Print ArXiv. He is also the recipient of a Microsoft Research Graduate Fellowship and a Yahoo! Key Scientific Challenges Award. His recent research focuses on machine learning approaches to learning from user interactions (implicit feedback), online experiment design, diversified retrieval, and interactive search.

(an ML talk in the HLTCOE Speaker Series)

Wed 02/24/10, 12:00pm, Hackerman B17**Learning a Hierarchical Compositional Shape Vocabulary for Multi-class Object Representation***Ales Leonardis, University of Ljubljana*

**Abstract:** Hierarchies allow sharing of features between the visually similar as well as dissimilar classes at multiple levels of specificity. This makes them potentially suitable for learning and recognizing a higher number of object classes. However, the success of the hierarchical approaches so far has been hindered by the use of hand-crafted features or predetermined grouping rules. In this talk, I will present a framework for learning a hierarchical compositional shape vocabulary for representing multiple object classes. The approach takes simple contour fragments and learns their frequent spatial configurations. These are recursively combined into increasingly more complex and class-specific shape compositions, each exerting a high degree of shape variability. The top-level vocabulary compositions code the whole shapes of the objects. The vocabulary is learned sequentially, layer after layer, statistically adjusting to the visual data. The lower layers are learned jointly on images of all classes, whereas the higher layers of the vocabulary are learned incrementally, by presenting the algorithm with one object class after another. Learning of the classes is supervised, where we assume that the positive and validation set of class images is given – however, we learn the hierarchical structure of each class in a completely unsupervised way. The experimental results show that the learned multi-class representation scales logarithmically with the number of classes and achieves the state-of-the-art detection performance at both, faster inference as well as training times. This is a joint work with Sanja Fidler and Marko Boben.

(an ML talk in the LCSR Speaker Series)

Tue 02/23/10, 10:45am, Hackerman B17**Nonparametric Learning in High Dimensions***Han Liu, Carnegie Mellon University*

**Abstract:** Despite the high dimensionality and complexity of many modern datasets, some problems have hidden structure that makes efficient statistical inference feasible. Examples of these hidden structures include: additivity, sparsity, low-dimensional manifold structure, smoothness, copula structure, and conditional independence relations. In this talk, I will describe efficient nonparametric learning algorithms that exploit such hidden structures to overcome the curse of dimensionality. These algorithms have strong theoretical guarantees and provide practical methods for many fundamentally important learning problems, ranging from unsupervised exploratory data analysis to supervised predictive modeling. I will use two examples of high dimensional graph estimation and multi-task regression to illustrate the principles of developing high dimensional nonparametric methods. The theoretical results are presented in terms of risk consistency, estimation consistency, and model selection consistency. The practical performance of the algorithms is illustrated on genomics and cognitive neuroscience examples and compared to state-of-the-art parametric competitors. This work is joint with John Lafferty and Larry Wasserman.

**Bio:** Han Liu is a fifth-year PhD student in the Machine Learning Department within the School of Computer Science at Carnegie Mellon University. He is in the Joint PhD program in Machine Learning and Statistics. His dissertation, directed by John Lafferty and Larry Wasserman, is entitled, “High Dimensional Nonparametric Learning and Massive-Data Analysis”. This study investigates fundamental theory and methods for high dimensional nonparametric inference and demonstrates their applicability to areas such as computational biology and cognitive neuroscience. Over the past two years, Han Liu has won the Google Ph.D. fellowship award in Statistics, the best student paper award at the 26th International Conference on Machine Learning (ICML 2009), and the 2010 best paper award in the ASA (American Statistical Association) student paper competition in Statistical Computing and Graphics. Han Liu obtained his Msc in Statistics and Machine Learning at Carnegie Mellon University in 2007 and another Msc in Computer Science at University of Toronto in 2005.

(an ML talk in the CS Speaker Series)

Thu 02/04/10, 04:00pm, Whitehead 304**Ranking and Selection of Many Alternatives using Correlated Knowledge Gradients***Peter Frazier, Cornell University*

**Abstract:** We consider the ranking and selection problem, in which one wishes to use a fixed sampling budget efficiently to find which of several alternatives is the best. By explicitly modeling the relationship between alternative values with a correlated Bayesian prior, sampling policies may perform efficiently even when the number of alternatives is very large. We propose the correlated knowledge-gradient sampling policy and show special cases in which it is Bayes-optimal. We apply it to two problems: drug discovery, and continuous global optimization.

(an ML talk in the AMS Speaker Series)

Thu 12/03/09, 10:45am, Hackerman B17**Tele-Immersion, The Cyber-Infrastructure for Studying Body Language***Ruzena Bajcsy, University of California, Berkeley*

**Abstract:** The tele-immersion system at CITRIS lab at UC Berkeley consists of 48 cameras in 12 stereo clusters. Images are processed by 12 computers running simultaneously and sending data via Internet II connection to near and remote rendering computers. This system enables to generate in real time (22 frames/second) 4D data of moving people. In this presentation I will show the following applications: * Two dancers each in separate locations dancing together in the Virtual space; * Two Geo-scientists in different locations discussing, manipulating and analyzing seismographic data interactively; * Archeologist analyzing 3D data from China virtually. The scientific agenda stemming from this infrastructure is the dynamic analysis of Human body action and its interpretation. We will show some preliminary results from segmentation and classification of human action. This work is collaboration between UC Berkeley, UIUC, UC Merced and UC Davis.

**Bio:** Dr. Ruzena Bajcsy was appointed Director of CITRIS and professor of EECS department at the University of California, Berkeley on November 1, 2001. In 2004 she became a CITRIS director emeritus and now she is a full time professor of EECS. Dr. Bajcsy is a pioneering researcher in machine perception, robotics and artificial intelligence. Dr. Bajcsy received her master’s and Ph.D. degrees in electrical engineering from Slovak Technical University in 1957 and 1967, respectively. She received a Ph.D. in computer science in 1972 from Stanford University.

(an ML talk in the CS Speaker Series)

Tue 12/01/09, 01:00pm, Clark 314**Solving Image Matching Problems Using Interior Point Methods***Camillo J. Taylor, University of Pennsylvania*

**Abstract:** The problem of finding correspondences between two images is central to many applications in computer vision. This talk will describe a new approach to tackling a variety of image matching problems using Linear Programming. The approach proceeds by constructing a piecewise-linear, convex approximation to the match score function associated with each of the pixels being matched. Regularization terms related to the first and second derivatives of the displacement function can also be modeled in this framework as convex functions. Once this has been done, the global image matching problem can be reformulated as a large scale Linear Program which can be solved using Interior Point methods. The resulting optimization problems are highly structured and efficient algorithms which exploit these regularities will be presented. The talk will describe applications of this approach to stereo matching and to recovering parametric image deformations that optimally register two frames.

**Bio:** Dr. Taylor received his A.B. degree in Electrical Computer and Systems Engineering from Harvard College in 1988 and his M.S. and Ph.D. degrees from Yale University in 1990 and 1994 respectively. Dr. Taylor was the Jamaica Scholar in 1984, a member of the Harvard chapter of Phi Beta Kappa and held a Harvard College Scholarship from 1986-1988. From 1994 to 1997 Dr. Taylor was a postdoctoral researcher and lecturer with the Department of Electrical Engineering and Computer Science at the University of California, Berkeley. He joined the faculty of the Computer and Information Science Department at the University of Pennsylvania in September 1997. He received an NSF CAREER award in 1998 and the Lindback Minority Junior Faculty Award in 2001. Dr Taylor’s research interests lie primarily in the fields of Computer Vision and Robotics and include: reconstruction of 3D models from images, vision-guided robot navigation and smart camera networks. Dr. Taylor has served as an Associate Editor of the IEEE Transactions of Pattern Analysis and Machine Intelligence. He has also served on numerous conference organizing committees and was a Program Chair of the 2006 edition of the IEEE Conference on Computer Vision and Pattern Recognition.

(an ML talk in the CIS Speaker Series)

Tue 11/24/09, 01:00pm, Clark Hall 110**Deformable Models and Tensor Algebraic Methods for Imaging Science***Demetri Terzopoulos, University of California, Los Angeles*

**Abstract:** I will discuss appearance-based and model-based methods for image analysis and synthesis. Natural images result from the multifactor interaction between the illumination, scene geometry, and image acquisition process. Numerical multilinear (tensor) algebra provides a principled mathematical framework for multifactor appearance-based image synthesis and recognition through multilinear generalizations of principal components analysis (PCA) and independent components analysis (ICA). Next, I will describe a powerful paradigm known as deformable models, which combines geometry, physics, and estimation theory. Deformable models evolve according to the continuum mechanical principles of flexible materials, expressed via variational principles and PDEs. My focus will be on applications in medical image analysis.

**Bio:** Demetri Terzopoulos is the Chancellor’s Professor of Computer Science at the University of California, Los Angeles. He graduated from McGill University and obtained his PhD degree from MIT in 1984. He is a Guggenheim Fellow, a Fellow of the ACM, a Fellow of the IEEE, a Fellow of the Royal Society of Canada, and a member of the European Academy of Sciences. His many awards and honors include the inaugural Computer Vision Significant Researcher Award from the IEEE for his pioneering and sustained research on deformable models and their applications, and a Technical Oscar from the Academy of Motion Picture Arts and Sciences for his pioneering work on physics-based computer animation. He is listed by ISI and other indexes as one of the most highly-cited authors in engineering and computer science, with more than 300 published research papers and several volumes, primarily in computer graphics, computer vision, medical imaging, computer-aided design, and artificial intelligence/life.

(an ML talk in the CIS Speaker Series)

Tue 11/17/09, 04:30pm, Hackerman B17**Graph Identification***Lise Getoor, University of Maryland*

**Abstract:** Within the machine learning and data mining communities, there has been a growing interest in learning structured models from input data that is itself structured or semi-structured. Graph identification refers to methods that transform observational data described as a noisy input graph into an inferred, “clean” information graph. Examples include inferring social networks from online, noisy, communication data, identifying gene regulatory networks from protein-protein interactions, and extracting semantic graphs from noisy and ambiguous co-occurrence information. Some of the key processes in graph identification are: entity resolution, collective classification, and link prediction. I will overview algorithms for these tasks, discuss the need for integrating the methods to solve the overall problem jointly. Time permitting, I will also give quick overviews of some of the other research projects in my group.

**Bio:** Lise Getoor is an associate professor in the Computer Science Department at the University of Maryland, College Park. She received her PhD from Stanford University in 2001. Her current work includes research on link mining, statistical relational learning and representing uncertainty in structured and semi-structured data. She has also done work on social network analysis and visual analytics. She has published numerous articles in machine learning, data mining, database, and artificial intelligence forums. She was awarded an NSF Career Award, is an action editor for the Machine Learning Journal, is a JAIR associate editor, has been a member of AAAI Executive council, and has served on a variety of program committees including AAAI, ICML, IJCAI, ISWC, KDD, SIGMOD, UAI, VLDB, and WWW. See http:www.cs.umd.edu/~getoor for more information.

(an ML talk in the CLSP Speaker Series)

Tue 11/17/09, 01:00pm, Clark 314**Clustering, Gaussian Mixture Models, and Sparse Eigenfunction Bases for Semi-Supervised Learning***Mikhail Belkin, Ohio State University*

**Abstract:** In this talk I will discuss some recent work on using eigenvectors of certain kernel matrices for clustering, learning Gaussian mixture models and semi-supervised learning. It turns out that certain eigenvectors of kernel matrices correspond to high-density clusters in the data and can be used to obtain the cluster structure of the data, and estimate the number of clusters. When the data comes from a Gaussian mixture distribution, these eigenvectors and the corresponding eigenvalues can be used to directly estimate the parameters of a Gaussian mixture, resulting in an algorithm, showing good results compared to the standard EM-based procedures, which is used in a wide variety of applications, including vision and speech. Finally, I will discuss how the cluster assumption in semi-supervised learning can be recast in terms of the classifier having a sparse representation in a data-dependent basis constructed from unlabeled data. Surprisingly, a standard method for sparse learning failed to take advantage of this sparcity and did not benefit from the unlabeled data, while the direct basis sub-selection algorithm showed performance comparable to the state of the art semi-supervised learners. Parts of this talk are joint work with Tao Shi, Kaushik Sinha and Bin Yu.

**Bio:** Mikhail Belkin is an Assistant Professor in Computer Science and Statistics departments at the Ohio State University. He received his PhD from the Mathematics department at the University of Chicago in 2003. He was a co-organizer of the Chicago Machine Learning Summer School in 2005, the Workshop on Geometry, Random Matrices, and Statistical Inference in SAMSI in 2007, and the 2009 Machine Learning Summer School/Workshop on Theory and Practice of Computational Learning. His primary research interests are artificial intelligence and statistical pattern recognition. His recent research focuses on designing and analyzing algorithms for machine learning based on non-linear structure of multi-dimensional data, such as manifold and spectral methods. Mikhail Belkin received the National Science Foundation (NSF) Career Award in 2006. His research is currently supported by the National Science Foundation and the U.S Air Force Research Foundation.

(an ML talk in the CIS Speaker Series)

Tue 11/03/09, 01:00pm, Clark 314**Learning from Data Using Matchings and Graphs***Tony Jebara, Columbia University*

**Abstract:** Many machine learning problems on data can naturally be formulated as problems on graphs. For example, dimensionality reduction and visualization are related to graph embedding. Given a sparse graph between N high-dimensional data nodes, how do we faithfully embed it in low dimension? We present an algorithm that improves dimensionality reduction by extending the Maximum Variance Unfolding method. But, given only a dataset of N samples, how do we construct a graph in the first place? The space to explore is daunting with 2^ > (N(N-1)/2) graphs to choose from yet two interesting subfamilies are tractable: matchings and b-matchings. By placing distributions over matchings and using loopy belief propagation, we can efficiently infer maximum weight subgraphs. These fast generalized matching algorithms leverage integral LP relaxation and perfect graph theory. Applications include graph reconstruction, graph embedding, graph transduction, and graph partitioning with emphasis on data from text, network and image domains. Time permitting, I will also present applications to large scale mobile telecommunication data.

**Bio:** Tony Jebara is Associate Professor of Computer Science at Columbia University and co-founder of Sense Networks. He directs the Columbia Machine Learning Laboratory whose research intersects computer science and statistics to develop new frameworks for learning from data with applications in vision, networks, spatio-temporal data, and text. He obtained his PhD in 2002 from MIT. Recently, Esquire magazine named him one of their Best and Brightest of 2008.

(an ML talk in the CIS Speaker Series)