Soc306/SML306: Machine Learning with Social Data: Opportunities and Challenges

Read the 2022 Syllabus

"Machine Learning with Social Data: Opportunities and Challenges" is a joint-listed undergraduate course between the Sociology department and the Center for Statistics and Machine Learning.  This is a class about using the tools of machine learning to study social data. The power of machine learning tools is their applicability for a wide range of tasks. There are huge opportunities for applying these tools to learn and make decisions about real people, but there are also important challenges. This course aims to (1) show social scientists and digital humanities scholars the potential of machine learning to help them learn about humans, make policy and help people while also (2) showing computer scientists how a social science research design perspective can improve their work and give them new outlets for their skills.

This is the second iteration of the course and my goal is to post course materials here as we move through the Spring 2022 semester.

Course Design:

The course is organized into three thematic modules: (1) the target task, (2) the data source, and (3) guiding values.  Each module will get the focus of three to four weeks of classes but we will talk about all three components throughout.  In addition to developing a theme, each week of class will include: at least one applied example, a set of opportunities and challenges posed by the theme and one machine learning technique. 

The target task module covers the goal of the analysis: measurement, prediction and causal inference. In this section we start with the sociological impacts of quantification through measurement and reconsider what the numbers in our dataset even mean.  We will then consider the distinction between three different types of prediction: prediction where the target looks like the training data (prediction with a population), prediction where the target looks different from the training data (prediction in a new population) and causal inference (prediction to a counterfactual population).

The data source module will consider how the origins of different kinds of data that one may encounter in academia or industry change the kinds of inference that can be drawn from then. This module picks up from the topic of causal inference and discusses experimental data including A/B tests as conducted within modern tech companies. We then discuss designed data—such as surveys where analysts get to control the way data is measured.  We will conclude with the setting of the majority of machine learning work—found data—where the data is originally collected for some other purposes.  This includes administrative data, digital trace data, electronic health records and many other types of sources.

The final module will focus on a topic that will be addressed broadly throughout the class—guiding values.  Picking up from the topic of found data, we start with a focus on the way such datasets can represent a threat to privacy.  We discuss potential harms as well as constructive solutions.  We then discuss fairness and racial justice, investigating the way that minority populations are often disproportionately harmed by applications of machine learning and algorithmic decision making.  We will consider different definitions of fairness and discuss constructive solutions for a more just application of machine learning.  The final week will tackle interpretability and discuss some of the reasons that we might want to prioritize interpretability of the estimated model and when that is important for accountability.


Course Materials:

Coming Soon