Not whether, but where: Scaling-up how we think about effects and relationships in natural educational contexts

Note: This post is a preprint of a paper that will be presented at SASLAS19, a preconference workshop for LAK19. The DOI for this paper is 10.13140/RG.2.2.30825.34407

And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right, And all were in the wrong!

  • From The Blind Men and the Elephant (Saxe, 1873; p. 260)

Under just the right conditions in natural educational settings, it is possible that any variable could be associated with significant changes, in either direction, for students’ learning outcomes. For example, research into the duration of inactivity in a course site (Conjin et al., 2017), the access of assignments after the deadline (Motz et al., 2019), the order of exemplars during study (Carvalho & Goldstone, 2017), and the immersiveness of instructional examples (Day, Motz, & Goldstone, 2015), have all found opposing benefits in different contexts. Whether a researcher observes positive evidence of such an effect, fails to observe a significant effect, or observes the opposite effect, may be principally determined by the scope of the researcher’s analysis, and not by whether the effect “exists.” Like the ancient parable of blind men developing opposing theories of a single elephant (e.g., Saxe, 1873), analytical research on student learning risks a similarly-absurd dispute about the observation of effects (or lacks thereof) in isolated studies, and what these opposing observations might mean.

The goal of this essay is to recommend a shift in thinking about “effects” and “relationships” as observed in authentic educational contexts, moving past thinking of these in binary terms (they are or aren’t observed; or they do or don’t replicate), to thinking of these as existing in varying degrees in different contexts. There is no single relationship between educationally-relevant variables that would hold constant across all learners and learning environments. The question for those analyzing data from authentic educational environments should not be whether such relationships exist, but instead, where they exist, to what degree. Furthermore, educational research, including learning analytics, must exist in the context of strong theories and models of learner’s cognition that can predict and explain why these dependencies exist, toward proposals of interventions that can leverage dependencies, instead of being hampered by them.

Any Effect Might Be Present in Some Classroom

Authentic classrooms are not randomly sampled from the space of all possible educational dimensions. Curriculum and course structures are engineered by teachers, administrators, faculty committees, software designers, and textbook publishers to produce positive gains for enrolled students. Rather than being random points in the multidimensional landscape of educational contexts, classrooms are architected learning factories; courses are designed in just such a way so that learning activities, the behaviors of the instructor, the supporting materials, and the surrounding environment all shuttle the enrolled students in the direction of positive learning outcomes. For example, teachers who assign weekly graded practice quizzes are crafting fundamentally different systems than teachers who assign ungraded weekly practice quizzes. The differences between these classes are not limited only to this single dimension of whether the weekly quizzes are credited or not. Both could be reasonably beneficial design solutions in different contexts. Just as the same musical note can elicit different emotions in different chords, any educationally-relevant variable could be inconsequential, or could be engineered to benefit learning, in different classrooms.

When one accepts that classes are not randomly drawn instances from some grand educational roulette wheel, two corollaries follow: (1) Any naturally-occurring variable may be architected in an educational context so as to produce a larger effect on learning outcomes, β, than the same variable’s effect in a different context. And thus (2) the measurement of effect β in an authentic classroom is an interaction between the variable under analysis and the class’s other covariates, not a main effect that should be expected to generalize across contexts.

Let’s consider an example. Imagine that an intrepid team of researchers aims to examine the relationship between some variable, perhaps class attendance, and learning outcomes. They aggregate attendance records and final exam scores for a large course whose data were convenient to access. If the observed effect of attendance on exam performance is 0%, 0.1%, -1% or 10%, what might they claim in these scenarios? Surely these are not generalizable estimates of the effect that attendance could have on learning performance in other classes (what if students had no access to learning materials outside class? —or what if the class activities only involved review of take-home readings?) as was compellingly demonstrated by Gašević et al. (2016). That any particular main effect is observed for any limited sample is rather unremarkable, because the estimate of that effect is determined largely by the context in which it is measured. Indeed, in the context of course design, if a teacher (not the research team) finds that attendance is not related to learning outcomes in the intended way, the teacher might change the relative value of attendance marks, increase active and collaborative problem solving in the classroom, or design other contextual modifications rather than simply conclude that attendance doesn’t “work.” The intrepid research team should also avoid the latter conclusion, which would be a severe out-of-sample overgeneralization.

The concerns discussed thus far are sometimes cast as criticisms against the broader research enterprise of mining and analyzing authentic learning data (for a discussion, see Morrison & van der Werf, 2016). But just as the intrepid research team should avoid making overgeneralizations about effects from limited samples, so too should theorists avoid making overgeneralizations about a complex domain of applied research from its youthful foibles. On the one hand, analyses of a relationship in a limited sample could be a very fruitful activity when a teacher seeks to engage in more data-driven design solutions within that precise context (Halverson et al., 2007), or when a limited sample is highly representative of a conventional instructional system that is theoretically interesting or practically relevant, perhaps because of its applicability to specific goals of education (e.g., 9th grade Algebra 1 or Introductory Chemistry recitations as gateways to STEM disciplines). But on the other hand, the broader activities of learning analytics, educational data mining, and other forms of education research utilizing big data could probably benefit from a reconsideration of how effects are analyzed and interpreted (see also Koedinger, Booth, & Klahr, 2013). Such reconsiderations may involve estimating effects separately for different kinds of courses (Motz et al., 2018c, 2019), developing new context-dependent theories of learning (Carvalho, 2018), and expanding the scope of experimental analyses to include a wide pool of independent samples (Motz et al., 2018b).

In the remaining sections, we attempt to motivate these reconsiderations by expanding on the possibility that any effect might be observed in some classroom, that thus, what may appear as main effects are more likely to be interaction effects, and then we discuss analytical tools that may scaffold a more robust and scalable perspective on effects and relationships in natural educational contexts.

Any Effect Might Be Observed in Some Classroom

When approaching a big dataset of natural behaviors, such as those increasingly available from e-learning environments, things will get messy. It might be tempting to view a theoretically-interesting effect or relationship as a needle in a haystack, but a more apt perspective might view the effect as a needle in a big stack of needles (which may also include some hay). There are no shortages of possible effects to be “discovered” during the analysis of a natural dataset, leading us to assert that in such a dataset, any effect might be observed (or might not be observed) in some subsample.

Consider the recent work of Silberzahn & Uhlmann (2015; et al. 2017), who recruited 29 different research teams to answer a single research question from a single dataset: Are soccer referees more likely to give penalty cards to dark-skin-toned players than light-skin-toned players? The dataset contained the full history of player/referee interactions for over 2,000 professional soccer players in four European countries, as well as the players’ demographics, photos, classification of skin tone (determined by independent raters), and a variety of additional covariates (team, position, etc.). The research analysts submitted their analytical plans (but withheld their provisional results) to a round-robin peer review and subsequently had the opportunity to revise their analyses. Nevertheless, despite this opportunity to converge on analytical approaches, final results varied widely among the participating researchers: effects ranged from 0.89 to 2.93 in odds ratio units (1.0 indicates no effect), with roughly two thirds of teams observing a significant effect, and one third finding no significant effect. The differences in outcomes resulted primarily from whether the analysis was sensitive to covariates and grouping variables present in the data.

While differences in analytical approaches will surely contribute to variability in measured effects, another factor influencing whether relationship are “discovered” is the size of the dataset. With increasing class sizes, and correspondingly increasing sample sizes, effects are more likely to fall beneath decision thresholds for statistical significance (commonly, the alpha-level), including spurious results and trivially small effects. For example, when analyzing the characteristics of digital camera auctions on eBay, Lin, Lucas, and Shmueli (2013) found that the magnitude of p-values in their analysis became meaninglessly close to zero when n > 700 (in a dataset containing over 300,000 observations). With a large enough sample, any scant difference is enough to claim statistical significance. In the case where an analyst might contrast two groups, A and B, Tukey (1991) observed, “The effects of A and B are always different - in some decimal place - for any A and B.” (p. 100) In this frame, whether someone detects an effect or relationship is really a question of sample size — and these days, behavioral researchers have access to some very large datasets. The observation of an effect is a fundamentally different issue from the relevance of an effect, leading many behavioral scientists toward new statistical standards concerned with effect size rather than effect presence (Serlin & Lapsley, 1985; Cumming, 2014).

The possibility of observing an “effect” is not only inflated by large samples and analytical variability; evidence for a spurious effect may also sprout in the soil of atheoretic exploratory analysis (Anderson et al., 2001). The paucity of theory in some applications of learning analytics and educational data mining yields fertile grounds for the discovery of effects and relationships that might be statistically-significant, but have no value for educational practice or for our understanding of educational systems (Wise & Shaffer, 2015). In the future, researchers will find ever-increasing opportunities to “discover” something practically meaningless as institutions continue to develop sprawling data warehouses to support as-yet-undefined future initiatives around learning analytics.

For an analyst who wonders whether an effect can be observed, the answer is surely “Yes.” In the absence of theory, in the presence of large datasets, and without clear methodological standards guiding our analytical plans, we should expect to find anything we want to find from natural educational data. In turn, researchers can benefit from a reconsideration of what is meant by the word “finding” in authentic learning contexts.

Any Effect Might Be an Interaction Effect

Toward the goal of reconsidering what is meant by a “finding,” one useful tack might be to reimagine all main effects in our analyses as being interaction effects within educational systems. For the most part, educational research has embraced the existence of individual differences in education. It is not controversial that different students will approach learning in a different way, and benefit differently from interventions. For example, Steyvers & Benjamin (2018) demonstrated that improvements in online brain training games interacts with the learner’s age, and Kalyuga et al. (2003) demonstrated that low-knowledge students benefit more from studying worked examples than high-knowledge students.

However, this embrace of dependencies has not expanded to include the effects that different contexts (i.e., what is learned, how it is learned) have on the effectiveness of the same learning approach (Jonassen, 1982). Carvalho (2018) proposed that if we use learning theory to guide exploration of content-treatment interactions, we can not only gain a deeper understanding of the learning process, but also how it can be improved in a general and scalable way. Take, for example, the interleaving effect (see Dunlosky et al., 2013). By using an interaction design approach, Carvalho & Goldstone (2013) were able to demonstrate that the interleaved effect did not generalize to all learning materials. Moreover, and perhaps more importantly, their analyses propose a model of learning over time that can account for content-dependencies and suggests that learning does not always happen by discrimination, leading to clear predictions of when interleaved study will and will not improve learning (Carvalho & Goldstone, 2014; 2017).

Theories that embrace content-dependencies have great potential for learning analytics. If, when approaching a research question, one questions not if A “works,” but instead if A differs in context X vs Y, one can learn not only that A works, but also why it works. This is because interactions help us understand the mechanism by which A works — if A works in X but not in Y, what is about X that makes it work? However, it is important to note that interactions (albeit statistically less likely to be found than main effects, especially with large samples) are not always relevant. Interaction designs should be used with theory-building in mind, and not to dismiss theory by saying “it all depends,” which would be reductio ad absurdum. While every educational effect may depend on a contextual variable, many dependencies are generalizable and are relevant to practice and theory, which is why we advocate for a science that systematically examines where these effects exist.

That any effect might exist in some context, and that these effects are context-dependent may also be viewed as precipitants of Rossi’s Iron Law of Evaluation (1987): “The expected value of any net impact assessment of any large-scale social program is zero” (p.4). If an educational intervention’s relationship with learning outcomes is variable across different classes, at large scales the aggregate (net) benefit of an intervention will tend toward zero. Just as analysts ought to think critically about the discovery of an effect, so too should analysts be skeptical when measuring the absence of a reliable main effect at scale. Favorable conditions for an effect are unlikely to be universally-present across large samples of classrooms, and identifying the conditions for an effect’s observation is an important pursuit if we are to make precise predictions about what “works.”

Any Effect Might Be Synthesized in Some Classroom

Discussion thus far has been occupied with the discovery of effects during the observation of natural datasets, but another research method bears mentioning: the experimental manipulation of a variable to produce an effect. In laboratory studies, where the setting is artificial and the environmental regime is tightly-controlled according to experimental standards, there may be less risk of variability in outcomes; indeed, laboratory studies are designed precisely so that the observed effects will replicate if all procedures are repeated with a new sample. But when conducting an embedded experiment in an authentic educational context (Motz et al., 2018a), the generalizability of an observed effect is much less certain.

In fairness, it should be noted that effects produced in embedded experiments have important advantages over effects found during the passive observation of natural datasets (Gordon et al., 2018). In particular, in an experiment, the context is held constant across experimental treatments, manipulating only those variables under analysis. However, the observed effect in that controlled context still may not be expected to generalize to different classes, because the size of any one measured effect is (as previously discussed) something that interacts with the structure of the class under observation.

Anecdotally, one of us recently discussed the design of an embedded experiment with another researcher, who was considering implementing the experiment in two of his sections during an upcoming semester. The researcher wanted to find a robust effect of his manipulation, so he was examining how he might structure the sections to facilitate this outcome. These considerations included: modifying the syllabus to highlight the experimental variable, emphasizing the variable with a take-home assignment, dedicating class time to a brief discussion of the variable, increasing the weight of grades more closely associated with the variable… At a certain point, we might wonder whether the observation of this effect would require an experiment in the first place! If a class can be architected to facilitate the observation of an effect, why should a researcher bother with the great effort and difficulty of demonstrating that effect?

For an effect observed in one class to be useful and generalizable, that class must be highly representative of a conventional instructional system that is theoretically interesting or practically relevant. Toward this goal, researchers should include documentation of the instructional context wherein an effect is observed. For example, in postsecondary learning environments, at minimum, authors should provide copies of class syllabi to accompany published reports from their embedded experiments, and moreover, they should highlight any course modifications made in support of the experimental contrast. But in keeping with the theme of this essay, rather than examining whether an effect is observed in a specific context, it might be more interesting to cast a wider net, examining where an experimental manipulation has different effects. But what might this “net” look like?

A scalable research model for evaluating experimental effects across a variety of authentic learning contexts is currently under development, called ManyClasses (Motz et al., 2018b). As with similar efforts in psychology (Many Labs, Klein et al., 2014; Many Babies, Frank et al., 2017), the core feature of ManyClasses is that researchers measure an experimental effect across many independent samples – in this case, across many classes. Rather than conducting an embedded learning study in just one educational context, a ManyClasses study would examine the same experimental contrast in dozens of contexts, spanning a range of courses, institutions, formats, and student populations. By inserting the same experimental manipulation across a diversity of educational implementations, and then analyzing pooled results, researchers can assess the degree to which an effect might yield benefits across a range of specific contexts. In addition to contributing to an estimation of the generalizable effect size of manipulations beyond particular classroom implementations, a ManyClasses study will also systematically investigate how a manipulation might be more or less effective for different students in different situations.

This ManyClasses model shares common ground with a nascent analytical strategy called a metastudy, also used for analyzing the robustness of an empirical claim across contexts (Baribault et al., 2018). A metastudy involves the radical randomization of experimental design decisions; rather than fixing the study context across conditions (which might include the number of trials, properties of the stimuli, incentives for participating, etc.), these facets are randomly drawn for each observation. In turn, data obtained from a metastudy goes beyond addressing whether an effect exists, to directly estimating the contextual dependencies of the observed effect. By embracing the view that effects will vary across contexts, and directly manipulating and quantifying this variability, researchers can develop a much more complete understanding of the causal chains under analysis.


So, oft in theologic wars The disputants, I ween, Rail on in utter ignorance Of what each other mean; And prate about an Elephant Not one of them has seen!

  • Final stanza from The Blind Men and the Elephant (Saxe, 1873)

Instructional technologists are oft to advertise new teaching and learning tools with the confident certification, “it works!” Data scientists implementing a new technique for predicting academic risk will claim, “our model works!” Psychologists examining students’ studying behaviors in a real class will conclude, “the strategy works!” In response, some skeptical and empirically-minded members of the education research community may scoff, “How do you know?” or “What is your evidence?” But all of these stances seem like non sequiturs, for any such instrument, activity, modeling approach, or strategy might “work” or might fail to “work” in different natural learning contexts. In scaling-up our perspective of these effects, perhaps learning analytics can avoid the dilemma of the blind men and the elephant, by accepting that different observations will necessarily yield different effects and relationships, and that these context-dependencies are theoretically-attractive objects of inquiry. In this paper, we hope to have motivated the view that where an effect exists in a real classroom, and to what degree, are much more meaningful concerns than whether that effect exists.


Anderson, D.R., Burnham, K.P., Gould, W.R., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311-316.

Baribault, B., Donkin, C., Little, D. R., Trueblood, J. S., Oravecz, Z., van Ravenzwaaij, D., … & Vandekerckhove, J. (2018). Metastudies for robust tests of theory. Proceedings of the National Academy of Sciences, 115(11), 2607-2612.

Carvalho, P.F. (2018). Understanding the dynamics of learning: The case for studying interactions. In T.T. Rogers, M. Rau, X. Zhu, & C. W. Kalish (Eds.), Proceedings of the 40th Annual Conference of the Cognitive Science Society (pp. 51-52). Cognitive Science Society.

Carvalho, P.F., & Goldstone, R.L. (2013). How to present exemplars of several categories? Interleave during active learning and block during passive learning. In M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the 35th Annual Conference of the Cognitive Science Society. Cognitive Science Society.

Carvalho, P.F. & Goldstone, R.L. (2014). Effects of interleaved and blocked study on delayed test of category learning generalization. Frontiers in Psychology, 5(936), 1-10.

Carvalho, P.F. & Goldstone, R.L. (2017). The most efficient sequence of study depends on the type of test. In G. Gunzelmann, A. Howes,, T. Tenbrink, & E. Davelaar (Eds.), Proceedings of the 39th Annual Conference of the Cognitive Science Society (pp 198-203). Cognitive Science Society.

Conijn, R., Snijders, C., Kleingeld, A., & Matzat, U. (2017). Predicting student performance from LMS data: A comparison of 17 blended courses using Moodle. IEEE Transactions on Learning Technologies, 10(1), 17-29.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.

Day, S. B., Motz, B. A., & Goldstone, R. L. (2015). The cognitive costs of context: The effects of concreteness and immersiveness in instructional examples. Frontiers in Psychology, 6, 1876.

Dunlosky, J., Rawson, K.A., Marsh, E.J., Nathan, M.J., & Willingham, D.T. (2013). Improving students’ learning with effective learning techniques. Psychological Science in the Public Interest, 14(1), 4–58.

Frank, M., Bergelson, E., Bergmann, C., Cristia, A., Floccia, C., Gervain, J., Lew-Williams, C., Nazzi, T., Panneton, R., Rabagliati, H., Soderstrom, M., Sullivan, J., Waxman, S., & Yurovsky, D. (2017). A collaborative approach to infant research: Promoting reproducibility, best practices, and theory building. Infancy, 22, 421-435.

Gašević, D., Dawson, S., Rogers, T., & Gasevic, D. (2016). Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success. The Internet and Higher Education, 28, 68-84.

Gordon, B. R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2018). A comparison of approaches to advertising measurement: Evidence from big field experiments at Facebook. Forthcoming at Marketing Science. Available at SSRN: or

Halverson, R., Grigg, J., Prichett, R., & Thomas, C. (2007). The new instructional leadership: Creating data-driven instructional systems in school. Journal of School Leadership, 17(2), 159.

Jonassen, D. H. (1982). Aptitude-versus content-treatment interactions. Journal of Instructional Development, 5(4), 15.

Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The expertise reversal effect. Educational Psychologist, 38, 23-31.

Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Jr., Bahník, Š., Bernstein, M. J., … Nosek, B. A. (2014). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45(3), 142-152.

Koedinger, K. R., Booth, J. L., & Klahr, D. (2013). Instructional complexity and the science to constrain it. Science, 342(6161), 935-937.

Lin, M., Lucas Jr., H.C., Shmueli, G. (2013). Too big to fail: Large samples and the p-value problem. Information Systems Research, 24(4), 906-917.

Morrison, K., & van der Werf, G. (2016). Large-scale data, “wicked problems,” and “what works” for educational policy making. Educational Research and Evaluation, 22(5/6), 255–259.

Motz, B. A., Carvalho, P. F., de Leeuw, J. R., & Goldstone, R. L. (2018a). Embedding experiments: Staking causal inference in authentic educational contexts. Journal of Learning Analytics, 5(2), 47-59.

Motz, B., de Leeuw, J., Carvalho, P., Fyfe, E., & Goldstone, R., (2018b). ManyClasses: A model for abstracting generalizable research principles from different learning contexts. A Workshop on Large Scale Education Replication. Buffalo, New York.

Motz, B., Busey, T., Rickert, M., Landy, D. (2018c). Finding topics in enrollment data. Proceedings of the 11th International Conference on Educational Data Mining. Buffalo, New York.

Motz, B., Quick, J., Schroeder, N., Zook, J., Gunkel, M. (2019). The validity and utility of activity logs as a measure of student engagement. In Proceedings of the 9th International Conference on Learning Analytics and Knowledge. ACM.

Rossi, P. (1987). The iron law of evaluation and other metallic rules. Research in Social Problems and Public Policy, 4, 3-20. Saxe, J.G. (1873). The poems of John Godfrey Saxe. Boston: James R Osgood & Company.

Silberzahn, R. & Uhlmann, E.L. (2015). Crowdsourced research: Many hands make tight work. Nature, 526(7572), 189-191.

Silberzahn, R., Uhlmann, E.L., Martin, D.P., Anselmi, P., Aust, F., Awtrey, E.C., … Nosek, B.A. (2018). Many analysts, one dataset: Making transparent how variations in analytical choices affect results (preprint). PsyArXiv.

Serlin, R.C., & Lapsley, D.K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40(1), 73-83.

Steyvers, M. & Benjamin, A.S. (2018). The joint contribution of participation and performance to learning functions: Exploring the effects of age in large-scale data sets. Behavior Research Methods.

Tukey, J. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

Wise, A. F., & Shaffer, D. W. (2015). Why theory matters more than ever in the age of big data. Journal of Learning Analytics, 2(2), 5-13.