On Study Design in Computational Humanities

Causal Inference, Natural Experiments, and the Conditions of Knowledge Production in Humanistic Research

Johan Malmstedt1
1Department of English and Comparative Literature, Columbia University, New York NY 10027, USA
Johan Malmstedt  0000-0001-9876-5432
* Correspondence: johan.malmstedt@lir.gu.se
Abstract

Reading Thad Dunning's Natural Experiments in the Social Sciences (Cambridge, 2012), this note reflects on the problem of study design in computational humanities research. Dunning's central question — how can causal inference be improved? — proves unexpectedly generative when transposed into humanistic inquiry, where causal and statistical assumptions are often difficult to explicate and defend, let alone validate. This note argues that the natural experiment framework, understood as a design-based method in which control over confounding variables emerges from research-design choices rather than ex post statistical adjustment, offers a productive model for computational approaches to literary and historical corpora.

1. On Causal Inference in Humanistic Research

Reading Thad Dunning’s Natural Experiments in the Social Sciences (Cambridge, 2012) I am particularly struck by his discussion of study design. “How can causal inference be improved?” he asks on page 4 and answers: “In seeking to answer such questions, I place central emphasis on natural experiments as a ‘design-based’ method of research — one in which control over confounding variables comes primarily from research-design choices, rather than ex post adjustment using parametric statistical models.”1

This approach seems particularly well-suited for computational study in the humanities, where the veracity of causal and statistical assumptions are often difficult to explicate and defend — let alone validate. The natural experiment approach seeks to shift reasoning about such assumptions from the statistical modeling phase of research to the design process, expressed in the logic of the design itself. In short, it is the research design, rather than the statistical model, that does the heavy lifting.

Figure 1
highmidlow ASSUMPTION EXPLICITNESS RESEARCH DESIGN STAGE HypothesisCorpus SelectionMethod ChoiceStatistical ModelOutput Design-based inference Model-based inference
Fig. 1. Schematic comparison of assumption explicitness across research design stages in design-based versus model-based approaches to causal inference.
Visualization: Author. Adapted from Dunning, Natural Experiments in the Social Sciences (Cambridge UP, 2012), Ch. 1.

2. Design as Argument

For this reason, Dunning writes, “substantive and contextual knowledge plays an important role at every stage of natural-experimental research — from discovery to analysis to evaluation.” The emphasis on context necessitates thinking about statistical concepts such as “effect” in ways that are grounded in the specific conditions of the historical or literary problem under study.2

This has immediate implications for how we frame computational humanities projects. Too often, the adoption of statistical or machine-learning methods in literary and historical research is understood as a gesture toward scientific legitimacy — a borrowing of method that preserves the assumption structure of the original social-scientific context while discarding the disciplinary safeguards that made those assumptions defensible.

Consider corpus selection. In standard distant reading practice, the corpus is assembled according to availability, prior canonical judgment, or the outputs of digitization projects, and is then treated as the ground on which analysis proceeds. From a design-based perspective, corpus selection is instead a moment of assumption-making that must be theorized explicitly: what counterfactual is implied by this particular set of texts?3

3. Toward a Design-Based Humanities

The natural experiment model is not, of course, directly transferable to the humanities. Historical and literary materials rarely permit the kind of exogenous variation that defines a true natural experiment in political science or economics. What the framework offers instead is a vocabulary for making explicit the design choices that computational humanities researchers already make implicitly.

A design-based computational humanities would begin not with a corpus and a method but with a research design: a specification of the variation to be exploited, the assumptions required to interpret that variation causally, and the limitations those assumptions impose on the conclusions that can be drawn.4

Lab Notes is precisely the kind of venue in which such methodological reflection can take place: short, process-oriented, and uncoupled from the pressure to present finished results. A note is an appropriate form for an observation that is not yet a finding — for a moment of reading that reorganizes the assumptions of a larger project still in progress.

References
  1. Bode, Katherine. "The Equivalence of 'Close' and 'Distant' Reading." Modern Language Quarterly 78.1 (2017): 77–106.
  2. Dunning, Thad. Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge: Cambridge University Press, 2012.
  3. Drucker, Johanna. "Humanities Approaches to Graphical Display." Digital Humanities Quarterly 5.1 (2011).
  4. King, Gary, Robert O. Keohane, and Sidney Verba. Designing Social Inquiry. Princeton: Princeton University Press, 1994.
  5. Moretti, Franco. Distant Reading. London: Verso, 2013.
  6. Piper, Andrew. Enumerations: Data and Literary Study. Chicago: University of Chicago Press, 2018.
  7. Underwood, Ted. Distant Horizons: Digital Evidence and Literary Change. Chicago: University of Chicago Press, 2019.
  8. Tenen, Dennis Yi. Plain Text: The Poetics of Computation. Stanford: Stanford University Press, 2017.
  9. Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. Urbana: University of Illinois Press, 2013.
  10. Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Urbana: University of Illinois Press, 2011.

Notes

  1. 1Dunning, Natural Experiments, 4. The emphasis on design echoes earlier methodological discussions in King, Keohane, and Verba's Designing Social Inquiry (1994).
  2. 2Dunning, Natural Experiments, 7. See also Drucker's critique of quantitative data visualization in humanistic contexts.
  3. 3The critique of corpus construction in distant reading has been developed by Bode (2017) and others.
  4. 4Piper's Enumerations (2018) and Underwood's Distant Horizons (2019) both demonstrate attentiveness to research design that goes well beyond method selection.
Published in
Lab Notes
Vol. 1, No. 1 (Winter 2026) · Article No. 4 · pp. 1–6
Publisher
metaLAB at Harvard, Cambridge MA
Publication history
Received03 Oct 2025
Revised12 Nov 2025
Accepted18 Nov 2025
Published01 Jan 2026
Keywords
  • causal inference
  • study design
  • natural experiment
  • computational humanities
  • research methods
  • distant reading
  • corpus construction
ACM Reference Format
Johan Malmstedt. 2026. On Study Design in Computational Humanities. Lab Notes. Vol. 1, No. 1 (Winter 2026) · Article No. 4 · pp. 1–6. https://doi.org/10.1145/3674158.3674162