Oct 25: Jacob Eisenstein: Sparse Models of Lexical Variation
Text analysis involves building predictive models and discovering latent structures in noisy and high-dimensional data. Document classes, latent topics, and author communities are often distinguished by a small number of trigger words or phrases — needles in a haystack of irrelevant features. In this talk, I describe generative and discriminative techniques for learning sparse models of lexical differences. First, I show how multi-task regression with structured sparsity can identify a small subset of words associated with a range of demographic attributes in social media, yielding new insights about the complex multivariate relationship between demographics and lexical choice. Second, I present SAGE, a novel approach to sparsity in generative models of text, in which we induce sparse deviations from background log probabilities. As a generative model, SAGE can be applied across a range of supervised and unsupervised applications, including classification, topic modeling, and latent variable models.
BIOGRAPHY