Johns Hopkins is consistently ranked as one of the top universities worldwide, with many schools and departments consistently ranked best in the world. We host some of the best and most interesting scientific data, and collaboration is one of JHU’s strong suits.
Our excellent cross-departmental community of machine learning faculty works on both general-purpose ML and more domain-specific ML methods. Here we survey some of our machine learning focus areas (explore individual faculty pages for more details).
General Purpose ML Methods
Several of us, especially in Computer Science and Applied Math and Statistics, focus on developing fundamental ML methods that cut across applications. Some examples:
- Bayesian methods including MCMC for hierarchical models
- Dimensionality reduction, including manifold learning, generalized PCA, etc.
- Statistics of graphs, with applications including social networks and brain-graphs
- Graphical models and causal inference, with applications including public health and policy
- Parallel computation for “big data” problems such as teravoxel image stitching, information retrieval, and optimization
- Robust statistics for biological networks and data mining
- Learning policies and parameters for fast, accurate approximate inference
- Novel nonparametric models, both frequentist and Bayesian
- Information-theoretic approaches to estimating discrete conditional distributions
- Semi-supervised learning for graphs and natural language
Speech and Natural Language [people; people]
Humans produce vast quantities of natural language—the primary medium of public knowledge, private records, and interpersonal communication. We are attacking the many statistical modeling and machine learning challenges needed to interact meaningfully with such data.
- Structured prediction: At the level of individual sentences, we develop new algorithms to efficiently infer complex and unbounded discrete variables: parse trees, speech transcriptions, text translations, and meaning representations.
- Large-scale inference: From large corpora, we extract factual knowledge and patterns of human communicative behavior. We are also reconstructing the grammatical and lexical structure of multiple languages at once. These difficult global inference problems require informed models and new semi-supervised and unsupervised techniques, including nonparametric Bayesian methods.
- Signal processing: On auditory data, we focus on learning intermediate representations that allow us to extract acoustic events such as phonemes from the auditory stream. We draw on manifold learning, information geometry, physical modeling, and the neuroscience of perception.
- Data scarcity: Humans talk about many things, in many languages and dialects and styles. Faced with a practical task—one particular type of language data and something to predict from it—one rarely has enough representative data to train millions of parameters. We seek general solutions to data scarcity, involving domain adaptation, multi-task learning, active learning, and crowdsourcing.
JHU is one of the top few universities worldwide in language and speech processing, and is well known for its focus on core statistical and algorithmic methodology. The Center for Language and Speech Processing serves to coordinate related research, education, and outreach across several departments plus the Human Language Technology Center of Excellence.
The pace of biological discovery has accelerated thanks to “next-generation” technologies, whose rate of data acquisition is increasing even faster than Moore’s law. Terabyte and petabyte datasets generated at JHU are leading to a new set of problems requiring machine learning solutions.
- Scalable learning: As data sets increase in size, scalable learning becomes increasingly important. Algorithms such as Markov chain Monte Carlo, which may perform well for small data sets, become intractably slow for realistic problems. For example, at JHU, we are attempting to analyze whole-genome data to reveal how stem cells differentiate and specialize and how new tissues are formed — bridging spatial scales from nucleotides to genes to cells to tissues to organisms.
- Small-sample learning: Many biological data sets measure a very large number of features on a much smaller set of biological samples. Standard statistical techniques fail in this regime. A full human genome sequence, for example, contains in principle over 3 billion features accounting for genetic variation at each nucleotide, yet many studies enroll only hundreds to thousands of individuals. Working with scientists at the School of Medicine, we are discovering how rare mutations lead to cancer predisposition, heart disease, neuropsychiatric disorders, and other medical conditions.
- Heterogeneous learning: Many classic inference problems are framed in terms of a data matrix and, for supervised problems, a vector to predict. Biological data sets often involve multiple types of data and require fundamentally different statistical models. Machine learning methods developed for homogeneous data are typically hard to extend to this case. At JHU, we are combining imaging with personal genetics, and combining data generated from metabolite profiling, mRNA sequencing, DNA sequencing, protein assays, and even natural language processing of the scientific literature to create a moving picture of the cell.
- Network science: Biological data sets often have natural representations as networks, with genes and proteins as vertices and specific interaction types as edges. While many machine learning methods have been developed for graph analysis, biological networks pose special challenges. Edges can be noisy, with varying experimental quality or biochemical strength, and can be time-dependent or condition-specific. Furthermore, while most social network data sets consider only a single interaction modality such as phone calls or emails, biological data analysis often requires integration of multi-modal or multi-scale data.
At the Homewood campus, the Department of Biomedical Engineering and the Institute for Computational Medicine have core faculty devoted to computational biology. At the medical school campus, faculty pursue machine learning in the High-Throughput Biology Center, the Institute of Genetic Medicine, the Department of Biostatistics, and the Center for Computational Genomics.
Humans effortlessly distinguish among a remarkable variety of objects, actions and interactions in complex, cluttered scenes. They can recognize subtle behaviors among several people interacting with each other and with everyday objects. In contrast, automatically interpreting such scenes has proved surprisingly resistant to decades of research.
Automatic interpretation of images and videos raises serious challenges for statistical inference and learning. The difficulties are especially pronounced when the objective is to uncover a complex statistical dependency structure within a large number of variables and yet the number of samples available for learning is relatively limited. At JHU, we are developing algorithms, often based on graphical models, for
- interpreting images efficiently
- reducing the dimensionality of high-dimensional datasets
- learning from very few examples
- clustering data living in multiple subspaces and manifolds
- discovering relationships in high-dimensional datasets
We apply these techniques to key challenges in computer vision, such as
- detecting faces and deformable objects (e.g. cats) in photographs
- recognizing object categories in photographs
- segmenting and tracking moving objects in video
- recognizing dynamic textures (water, smoke, fire) in video
- recognizing human activities in video
- modeling and recognizing skill in surgical motion and video data
JHU is a major player in computer vision, focusing on foundational research in the field. The Center for Imaging Science serves to coordinate related research, education, and outreach across several JHU departments. The Vision Sciences Group at the Homewood campus brings together the study of machine vision and biological vision.
Robots must act in the physical world to accomplish a specific task or objective. Typically, their actions are governed by information acquired from multiple sources and at various time scales. How does the robot decide what data to acquire, how to relate it to the task at hand, and what actions to take? Some examples of machine learning in JHU’s Sensor-Based Robotics research:
- Modeling and mining human action: We can now record humans performing tasks unobtrusively and at scale. For example, we can record any of the 270,000 surgeries performed annually with our da Vinci surgical robot. We use ML techniques to process these complex motion signals into quasi-grammatical structures that can then be exploited to augment or automate tasks.
- Data to information: Video cameras now produce far more data than any human can realistically process. For example, the so-called Pill Cam, which is swallowed by a patient, produces 50,000 images over 8 hours which must be reviewed diagnostically. We develop ML techniques to automate such assessments of images.
- Learning and control: As devices become more complex and diverse, it becomes impractical to specify the control policies for all necessary tasks. Machine learning is used to learn control from examples, or to achieve a particular objective.
JHU’s world-class robotics group, the Laboratory for Computational Sensing and Robotics, has particular strength in medical applications.
Modern healthcare is being transformed by new and growing electronic resources, with hospitals generating terabytes of imaging, diagnostic, monitoring, and treatment data. Machine learning is central to utilizing these rapidly expanding datasets, combing through data across patients, clinics, and hospitals to uncover more effective treatments and practices that increase the quality and longevity of human life. At Johns Hopkins, machine learning researchers are partnering with health care professionals to tackle these high impact problems, such as:
- Are there sub-populations that respond more effectively to a treatment?
- What observations by physicians can serve as early warning signs for the onset of chronic illness?
- Can we detect emerging medical epidemics from social networks?
- Can we build personalized real-time data-driven care managers to manage chronic conditions?
- Can we aid medical decision making in high-risk breast, ovarian, prostate, and colorectal cancer families based on computational analyses of protein evolution and structure?
- How can clinical random trials be optimized for better estimates of treatment effects?
- What patterns can we discover from large-scale DCE-MRI and fMRI data?
These challenges require novel machine learning algorithms tailored to the challenges of large scale medical data. We seek to build accurate and reliable algorithms upon which doctors can make decisions, from the level of individual patients to widespread public policy.
The new Center for Personalized Cancer Medicine and Center for Population Health Information Technology (CPHIT) both aim to systematically improve patient care through learning algorithms that run on massive datasets of genetic data or electronic medical records.
Neuroscientific paradigms at JHU range widely: e.g., molecular genetics, in vivo calcium imaging, multi-electrode array recording, and magnetic resonance imaging (MRI). The resulting complex datasets often have millions of dimensions and may help to unravel many longstanding scientific questions. Some of our machine learning goals:
- Explanatory modeling: A fundamental goal is to learn explanatory, causal, high-dimensional graphical models of whole-brain response over time. That is, what is the joint distribution over brain activity (spike trains, fMRI, or EEG) and its external visual, auditory, tactile, and motor correlates?
- Prediction from high-dimensional data: We are diagnosing psychiatric conditions (e.g., ADHD) from whole-brain activation patterns. Separately, we are combining machine learning with differential geometry to diagnose neurodegenerative diseases from anatomical images.
- Random graph models: JHU hosts many connectome datasets, including the world’s largest: 10 TB of electron microscopy on brain slices. We are inferring spatial connectivity graphs from the raw images and building stochastic models of the graph structure.
No single center coordinates neuroscience research at JHU, although the Brain Science Institute helps to do so. The Department of Neuroscience is home to several of the world’s most highly cited neuroscientists. Investigators from the Krieger Mind/Brain Institute, Biomedical Engineering, Biostatistics, and Psychological and Brain Sciences are developing and testing theories of neural coding, attention, object recognition, and motor control. Neuroimaging research at the Center for Imaging Science uses state-of-the art data collected at the F. M. Kirby Center for Functional Brain Imaging and the world’s top Neurology and Neurosurgery department.
Historically, machine learning has focused on data points that are assumed to be sampled independently or exchangeably from some distribution. These assumptions are inappropriate for network data such as social networks, transportation networks, and biological neural networks. A number of basic network-based questions are actively being investigated by members of the ML@JHU community, including:
- Graph invariants: What graph invariants are most powerful for various hypothesis testing and anomaly detection tasks? Can observed vertex and edge attributes lead to more powerful tests by considering graph structure jointly with graph content?
- Graph embeddings: Graphs are inherently high-dimensional non-Euclidean objects. How can we embed graphs in low-dimensional Euclidean (or near-Euclidean) spaces for pattern recognition and visualization? How can we choose the optimal dimension for a specific task?
- Semi-supervised classification using graph structure: When attributes of some nodes or edges are observed, can we predict attributes of other nodes or edges in the same network? For example, given a few examples of fraudulent actors in a social network, communication network, or financial network, can we find more like them? Given a noisy data set from high-throughput biological assays, can we discard spurious edges and nominate missing edges for testing?
Networks are studied in several places at JHU. Several faculty members in the Department of Applied Mathematics and Statistics study graph theory or graph statistics. Investigators in the departments of Biomedical Engineering, Electrical and Computer Engineering, and the Applied Physics Laboratory study networks in various domains, including genetics, transportation networks, and defense applications. Additionally, members of the Human Language Technology Center of Excellence are involved in network science.