Compressed Sensing and Document Embeddings
A common application for low-dimensional document representations is text classification (e.g. via logistic regression/SVM). However, simple methods such as Bag-of-Words and Bag-of-n-Grams often outperform these distributed embeddings. We use the theory of compressed sensing to prove that, by preserving the information of the simpler methods, LSTM representations can do at least as well on linear text classification as Bag-of-n-Grams. In addition, we discover the surprising fact that given a document represented as a sum of pretrained word embeddings (e.g. GloVe/word2vec) one can recover the document's Bag-of-Words using basis pursuit (the noiseless version of the LASSO estimator).
Joint with Sanjeev Arora, Nikunj Saunshi, and Kiran Vodrahalli
Spot Price Dynamics in Cloud Computing
We introduce a model for spot prices in two-market (spot and on-demand) cloud-computing environments. Using this we are able to gain insight into cloud provider behavior and learn parameters of a nonlinear dynamical systems that allow for spot price prediction. These predictions can then be used to inform bidding between instances to reduce the monetary cost of parallelizable jobs.
Joint with Liang Zheng, Andrew Lan, Carlee Joe-Wong, and Mung Chiang.
Self-Annotated Reddit Corpus (SARC)
Using the "/s" sarcasm annotation commonly used in the Reddit community we gather a large dataset for detecting sarcasm in online comments. The corpus has more than one million comments and can be used for exploratory purposes and to specify tasks for large-scale machine learning in both the balanced and unbalanced label setting.
Joint with Nikunj Saunshi and Kiran Vodrahalli.
Automated WordNet Construction with Word Vectors
We develop a method to automatically construct foreign-language WordNets using word embeddings and dictionary learning to augment machine translation. The English WordNet is a crucial tool in natural language processing that documents relations between words as well as other linguistic information. Our approach avoids the costly and time-consuming process of hand-constructing such a database for other languages while maintaining good precision and concept coverage.
Joint with Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora.
Multi-Fluid Modeling of Interpenetrating Hohlraum Plasmas
We develop a multi-fluid model for studying the interpenetration of counter-streaming hohlraum plasmas. This problem is crucial to understanding density build-ups and the scattering of laser light in the fusion device that can hamper experiments, but current single-fluid models are too simple while particle-in-cell simulations are costly. Our model is closer qualitatively to experimental results and can be extended to arbitrary ion species and various experimental conditions.
Incentive Schemes for Internet Exchange Points
We model incentive schemes for setting up local routing agreements and Internet exchange points in developing countries and examine the use of semi-definite relaxations and simulated annealing approaches to solve the resulting non-convex optimization problems. Since such setups incur high starting costs but greatly improve service, understanding these markets is important to understanding how best to expand Internet coverage.
Joint with Michael Chang; advised by Sanjeev Arora and Nick Feamster.
Particle-in-Cell Simulations for Plasma Propulsion
We investigate the problem of plasma detachment in an expanding magnetic field using a massively-parallel particle-in-cell code. While individual charged particles travel parallel to magnetic field lines, high-energy dense plasmas are predicted to detach from them due to turbulence effects. Predicting when and how this occurs is important to understanding the thrust of some space-based propulsion systems, which eject charge particles from magnetic nozzles. Through three-dimensional simulations of the expansion region we show that this plasma detachment is likely to occur in experimental settings.