Methods & Meta-science

Third time lucky

I would like to discuss some data from an experiment designed to investigate perceptual mergers in Russian and German. The “universal” hypothesis predicted incomplete mergers and unified cues; the “unique” hypothesis predicted variable degrees of mergers and language-specific cues. Listeners (20 subjects per language) judged a series of synthetic stimuli on a 9-point scale, assessing the goodness-of-match between the pitch of the stimulus and a linguistic context. In each language, 18 stimuli (2x3x3 design) were paired with two contexts. Reviewers of the manuscript (now in its third round of revisions) have made conflicting statistical recommendations, so I am looking for independent expert advice on these matters. Specifically, there are two points I would like to get Statitics Tea Tasters' feedback on: (1) I was asked to replace my repeated measures ANOVA with LMEMs. However, the reviewers have not agreed on whether or not to include models with language as a group factor given that there were some differences between the Russian and German stimuli. (2) In both languages, two factors were highly significant. In order to be able to argue in favour of the cues being language-specific, I fitted Gaussians to response distributions on the two factors for each language and calculated the amount of overlap between density functions. The results support the hypothesis, but is this a legitimate approach to categorical data? Could there be a more appropriate alternative that tests e.g. effect size?