Publications

2018
Khodak, Mikhail, et al.A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors”. To Appear in the Proceedings of the Association for Computation Linguistics (ACL) 2018. Print.Abstract
Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings. Our method relies mainly on a linear transformation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable “on the fly” in the future when a new text feature or rare word is encountered, even if only a single usage example is available. We introduce a new dataset showing how the a la carte method requires fewer examples of words in context to learn high-quality embeddings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.
Arora, Sanjeev, et al.A Compressed Sensing View of Unsupervised Text Embeddings, Bag-of-n-Grams, and LSTMs”. Proceedings of the 6th International Conference on Learning Representations (ICLR) 2018. Web. Publisher's VersionAbstract
Low-dimensional vector embeddings, computed using LSTMs or simpler techniques, are a popular approach for capturing the “meaning” of text and a form of unsupervised learning useful for downstream tasks. However, their power is not theoretically understood. The current paper derives formal understanding by looking at the subcase of linear embedding schemes. Using the theory of compressed sensing we show that representations combining the constituent word vectors are essentially information-preserving linear measurements of Bag-of-n-Grams (BonG) representations of text. This leads to a new theoretical result about LSTMs: low-dimensional embeddings derived from a low-memory LSTM are provably at least as powerful on classification tasks, up to small error, as a linear classifier over BonG vectors, a result that extensive empirical work has thus far been unable to show. Our experiments support these theoretical findings and establish strong, simple, and unsupervised baselines on standard benchmarks that in some cases are state of the art among word-level methods. We also show a surprising new property of word embeddings such as GloVe and word2vec: they form a good sensing matrix for text that is more efficient than random matrices, a standard sparse recovery tool, which may explain why they lead to better representations in practice.
Khodak, M., N. Saunshi, and K. Vodrahalli. “A Large Self-Annotated Corpus for Sarcasm”. Proceedings of the Language Resources and Evaluation Conference (LREC) 2018. Web. Publisher's VersionAbstract
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in regimes of both balanced and unbalanced labels. Each statement is furthermore self-annotated -- sarcasm is labeled by the author and not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, compare it to previous related corpora, and provide baselines for the task of sarcasm detection.
Khodak, M., et al.Learning Cloud Dynamics to Optimize Spot Instance Bidding Strategies”. Proceedings of the International Conference on Computer Communications (INFOCOM) 2018. Print.Abstract

As infrastructure-as-a-service clouds become more popular, cloud providers face the complicated problem of maximizing their resource utilization by handling the dynamics of user demand. Auction-based pricing, such as Amazon EC2 spot pricing, provides an option for users to use idle resources at highly reduced yet dynamic prices; under such a pricing scheme, users place bids for cloud resources, and the provider chooses a threshold “spot” price above which bids are admitted. In this paper, we propose a nonlinear dynamical system model for the time-evolution of the spot price as a function of latent states that characterize user demand in the spot and on-demand markets. This model enables us to adaptively predict future spot prices given past spot price observations, allowing us to derive user bidding strategies for heterogeneous cloud resources that minimize the cost to complete a job with negligible probability of interruption. Along the way, the model also yields novel, empirically verifiable insights into cloud provider behavior. We experimentally validate our model and bidding strategy on two months of Amazon EC2 spot price data and find that our proposed bidding strategy is up to 4 times closer to the optimal strategy in hindsight compared to a baseline regression approach while incurring the same negligible probability of interruption.

2017
Khodak, M., et al.Automated WordNet Construction Using Word Embeddings”. EACL Workshop on Sense, Concept and Entity Representations and their Applications (SENSE) 2017. Web. Publisher's VersionAbstract

We present a fully unsupervised method for automated construction of WordNets based upon recent advances in distributional representations of sentences and word-senses combined with readily available machine translation tools. The approach requires very few linguistic resources and is thus extensible to multiple target languages. To evaluate our method we construct two 600-word test sets for word-to-synset matching in French and Russian using native speakers and evaluate the performance of our method along with several other recent approaches. Our method exceeds the best language-specific and multi-lingual automated WordNets in F-score for both languages. The databases we construct for French and Russian, both languages without large publicly available manually constructed WordNets, will be publicly released along with the test sets.

2015
Khodak, M., et al.Development and application of a multi-fluid simulation code for modeling interpenetrating plasmas”. 57th Annual Meeting of the APS Division of Plasma Physics (DPP) 2015. Print.Abstract

A multi-fluid model, with independent velocities for all species, is developed and implemented for the numerical simulation of the interpenetration of colliding plasmas. The Euler equations for fluid flow, coupled through electron-ion and ion-ion collisional drag terms, thermal equilibration terms, and the electric field, are solved for each ion species with the electrons treated under a quasineutrality assumption. Fourth-order spatial convergence in smooth regions is achieved using flux-conservative iterative time integration and a Weighted Essentially Non-Oscillatory (WENO) finite volume scheme employing an approximate Riemann solver. Analytic solutions of well-known shock tube tests and spectral solutions of the linearized coupled system are used to test the implementation, and the model is further numerically compared to interpenetration experiments such as those of J.S. Ross et al. [Phys. Rev. Lett. 110 145005 (2013)]. This work has applications to laser-plasma interactions, specifically to hohlraum physics, as well as to modeling laboratory experiments of collisionless shocks important in astrophysical plasmas.

2013
Cohen, S., et al.A method to exhaust energy and ash from small, aneutronic FRC reactors”. Workshop on Exploratory Topics in Plasma and Fusion Research (EPR) 2013. Print.Abstract

Power and ash exhaust from magnetic fusion reactors that burn advanced (low-neutron) fuels will be more demanding than for D-T-fueled reactors because a much greater fraction of the fusion products and power are contained within the advanced-fuel-burning plasma. We describe how both fusion power and ash can be effectively removed from the core plasma of an advanced-fuel-burning reactor to its scrape-off layer (SOL) if the reactor is a small field-reversed-configuration (FRC) device. Once the fusion power is in the FRC’s SOL, its linear geometry allows large plasma expansion in remote divertors, thus reducing the peak heat load to acceptable values.
The process relies on fusion-product slowing down in the FRC’s SOL, a rapid process in the FRC’s relatively cool SOL. In a small D-3He-burning FRC reactor, the problematic fusion products are 4He, p, and T. At birth, these ions will have gyroradii about 1/3 as large as the O-point-to-separatrix distance, hence many will pass through the SOL even on their first orbits. We show that once even a small portion of the fast ion trajectory passes through the SOL the ions rapidly lose energy there, in less than 0.1 s, and guiding centers migrate to the SOL. 
We use the FieldCoils code to calculate the magnetic field, a modified version of the RMF code to illustrate the transition of ion trajectories from betatron to figure-8 to cyclotron orbits and to calculate the fraction of fusion products that pass through the SOL, the UEDGE code to calculate the SOL and divertor parameters for a variety of fueling and heating scenarios, and the LSP code to calculate the fast-ion slowing down in the SOL, a novel situation because the Debye length is less than the electron gyroradius and the fast-ion velocity is larger than the electron thermal velocity.