Learning and Natural Language Processing

I am interested in finding provable methods for machine learning problems, applications in natural language processing, and intersections with computational economics and game theory.

Contents

Compressed Sensing and Document Embeddings

A common application for low-dimensional document representations is text classification (e.g. via logistic regression/SVM). However, simple methods such as Bag-of-Words and Bag-of-n-Grams often outperform these distributed embeddings. We use the theory of compressed sensing to prove that, by preserving the information of the simpler methods, LSTM representations can do at least as well on linear text classification as Bag-of-n-Grams. In addition, we discover the surprising fact that given a document represented as a sum of pretrained word embeddings (e.g. GloVe/word2vec) one can recover the document's Bag-of-Words using basis pursuit (the noiseless version of the LASSO estimator).

Manuscript (OpenReview, Accepted to ICLR 2018)

Joint with Sanjeev Arora, Nikunj Saunshi, and Kiran Vodrahalli

Self-Annotated Reddit Corpus (SARC)

Using the "/s" sarcasm annotation commonly used in the Reddit community we gather a large dataset for detecting sarcasm in online comments. The corpus has more than one million comments and can be used for exploratory purposes and to specify tasks for large-scale machine learning in both the balanced and unbalanced label setting.

Manuscript (arXiv, Accepted to LREC 2018) 

Dataset / Code

Joint with Nikunj Saunshi and Kiran Vodrahalli.

Automated WordNet Construction Using Word Embeddings

We develop a method to automatically construct foreign-language WordNets using word embeddings and dictionary learning to augment machine translation. The English WordNet is a crucial tool in natural language processing that documents relations between words as well as other linguistic information. Our approach avoids the costly and time-consuming process of hand-constructing such a database for other languages while maintaining good precision and concept coverage.

Paper (Presented at SENSE@EACL 2017)

Dataset / Code

Joint with Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora.