Methods & Meta-science

Between-group matching of control variables: Why covariates remain important for analysis

Experimental designs are often within-subjects and between-items, or vice versa. For example, in lexical decision experiments, conditions frequently involve different types of words (between-item manipulation) all of which are presented to each participant (within-subject manipulation). Such designs almost always pose the problem of controlling for confounds (lexical frequency etc.). Ideally, the different groups of items should not differ on such control variables. However, since a ‘perfect match’ is often hard to achieve, researchers typically rely on performing a significance test at item level (e.g., do Type-A words differ from Type-B words in terms of lexical frequency?) assuming that in case of a non-significant difference (say, p > .1) they can “rest assured” that the confounding variable is unlikely to explain any effects of experimental condition (e.g., type of word). I will show that one can actually not “rest assured” in this case, and that the currently perhaps most common practise of dealing with between-group confounds in such designs is flawed: Performing a significance test at item-level does not take into account that the same set of items is repeatedly presented to many participants. The latter increases the effective sample size (and therefore power) for any cross-condition bias in the confounding variable, and thus, any influences such a bias might have on the experimental results. To illustrate this, I will present results from a series of Monte Carlo simulations. Each simulated within-subject/between-item experiment had two conditions (20 items each) and up to 30 “subjects”. The two groups of items always differed non-significantly (p ranging from .10 to .97; t-test) on a randomly determined covariate with known linear effect on the dependent variable of interest. The generative model added various sources of random variation at subject, item, and trial level to the data, and assumed either no effect of experimental condition (estimating Type I error rate) or a condition-effect in the same or the opposite direction to the bias in the covariate (estimating power). Data analyses were based on by-subject ("F1") AN(C)OVA and maximal Linear Mixed Effects Models (LMEMs), each time either including or excluding the confounding variable as a covariate. The results were clear. Ignoring the covariate led to unacceptably high Type I error rates in F1, with more subjects leading to increasing levels of anticonservativity; conversely, power was greatly reduced for condition effects that were masked by an opposing bias in the covariate. Including the covariate in the analysis alleviated both of these problems, but not completely because F1 does not account for additional by-item random intercept variance in the generative model. Maximal LMEMs without covariate were too conservative when condition effects were suppressed by an opposing bias in the covariate (their performance generally suffered from conflating covariate-related item variance with additional by-item random intercept variance in the generative model). Maximal LMEMs including the covariate were near-optimal in terms of both Type I error rate and power. To conclude, showing a non-significant bias at item-level is actually not very useful when dealing with confounding variables like (say) lexical frequency, particularly when the same set of items is presented to many participants. Indeed, such control variables should always be included in the analysis model. Note that the labels ‘subject’ and ‘item’ are fully interchangeable—the exact same problem occurs when groups of subjects (assumed to be matched on some person-specific confound) are repeatedly tested over many items or trials. Slides, data, and code: