Simple Semantic Feature Embeddings

Extending the success of word embeddings to more general semantic features has proven challenging and computationally difficult. We introduce a generic method based simple linear regression that can embed any language feature given examples of its usage in context. It is also very effective for one-shot and few-shot learning of word embeddings. 

[Paper] [Slides] [Dataset] [Code] [Corpus Data]

Compressed Sensing and Document Embeddings

A common application for low-dimensional document representations is text classification (e.g. via logistic regression/SVM). However, simple methods such as Bag-of-Words and Bag-of-n-Grams often outperform these distributed embeddings. We use the theory of compressed sensing to prove that, by preserving the information of the simpler methods, LSTM representations can do at least as well on linear text classification as Bag-of-n-Grams. In addition, we discover the surprising fact that given a document represented as a sum of pretrained word embeddings (e.g. GloVe/word2vec) one can recover the document's Bag-of-Words using basis pursuit (the noiseless version of the LASSO estimator).

[Paper] [Poster] [Embedding Code] [Recovery Code] [Word Vectors]

Joint with Sanjeev Arora, Nikunj Saunshi, and Kiran Vodrahalli

Spot Price Dynamics in Cloud Computing

We introduce a model for spot prices in two-market (spot and on-demand) cloud-computing environments. Using this we are able to gain insight into cloud provider behavior and learn parameters of a nonlinear dynamical systems that allow for spot price prediction. These predictions can then be used to inform bidding between instances to reduce the monetary cost of parallelizable jobs.

[Technical Report] [Slides]

Joint with Liang Zheng, Andrew Lan, Carlee Joe-Wong, and Mung Chiang.

Self-Annotated Reddit Corpus (SARC)

Using the "/s" sarcasm annotation commonly used in the Reddit community we gather a large dataset for detecting sarcasm in online comments. The corpus has more than one million comments and can be used for exploratory purposes and to specify tasks for large-scale machine learning in both the balanced and unbalanced label setting.

[Paper] [Dataset] [Code]

Joint with Nikunj Saunshi and Kiran Vodrahalli.

Equilibrium-Seeking Congestion Control

Some recent proposals for congestion control have sought to justify their protocols as converging to the Nash equilibrium of a multi-player partial-information utility game in finite time. We propose a simple, distributed, gradient ascent-based protocol that provably reaches equilibrium in polynomial time and performs well on simulations. 


Joint with Nikunj Saunshi; advised by Jen Rexford.

Please also see parallel but more-complete work of Dong, Meng, et al.

Automated WordNet Construction with Word Vectors

We develop a method to automatically construct foreign-language WordNets using word embeddings and dictionary learning to augment machine translation. The English WordNet is a crucial tool in natural language processing that documents relations between words as well as other linguistic information. Our approach avoids the costly and time-consuming process of hand-constructing such a database for other languages while maintaining good precision and concept coverage.

[Paper] [Poster] [Code] [Dataset]

Joint with Andrej Risteski, Christiane Fellbaum, and Sanjeev Arora.

Multi-Fluid Modeling of Interpenetrating Hohlraum Plasmas

We develop a multi-fluid model for studying the interpenetration of counter-streaming hohlraum plasmas. This problem is crucial to understanding density build-ups and the scattering of laser light in the fusion device that can hamper experiments, but current single-fluid models are too simple while particle-in-cell simulations are costly. Our model is closer qualitatively to experimental results and can be extended to arbitrary ion species and various experimental conditions.


Joint with Dick Berger, Tom Chapman, and Jeff Hittinger.

Incentive Schemes for Internet Exchange Points


We model incentive schemes for setting up local routing agreements and Internet exchange points in developing countries and examine the use of semi-definite relaxations and simulated annealing approaches to solve the resulting non-convex optimization problems. Since such setups incur high starting costs but greatly improve service, understanding these markets is important to understanding how best to expand Internet coverage.


Joint with Michael Chang; advised by Sanjeev Arora and Nick Feamster.

Particle-in-Cell Simulations for Plasma Propulsion

We investigate the problem of plasma detachment in an expanding magnetic field using a massively-parallel particle-in-cell code. While individual charged particles travel parallel to magnetic field lines, high-energy dense plasmas are predicted to detach from them due to turbulence effects. Predicting when and how this occurs is important to understanding the thrust of some space-based propulsion systems, which eject charge particles from magnetic nozzles. Through three-dimensional simulations of the expansion region we show that this plasma detachment is likely to occur in experimental settings. 


Advised by Sam Cohen.