Last changed 27 Dec 2022 ............... Length about 4,000 words (32,000 bytes).
(Document started on 25 Mar 2011.) This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/best/effect.html. You may copy it. How to refer to it.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [best] [this page]

Effect size

By Steve Draper, Department of Psychology, University of Glasgow.

Preface

This page is mainly about "effect size", which is a concept that tries to remedy some of the deficiencies of just doing significance testing.

There is an emerging argument that effect size does not solve all those deficiencies: see the last section "Beyond effect sizes".

Main part of this page

This page is about effect size (ES): what it is in a slightly wider perspective than just statistics. If you just want the statistical view and tests, the wikipedia page seems good.
Also see this: Mike Clark's PDF local copy of his slides on effect sizes.

Whereas statistical tests such as a t-test aim to tell you what degree of certainty to attribute to the possibility that a difference is not an accident but an effect, another important question is "How important is that difference (if real)?". I shall use the term "effect size" as a general title for this whole question; group all statistical tests of effect size into one topic within the general question; and the issue of which of the alternative stats tests (e.g. "Cohen's d") is best as a subtopic.

Bloom84 gives a powerful argument about how to use effect sizes in planning a programme of applied research.

The main point of measuring effect sizes is to compare the relative importance of different effects, phenomena. An example from psychology illustrates this, by comparing established gender effects -- and how some are really big, but many others of small, even negligible, practical importance.

Comparing effect sizes: getting a sense of the range

A table of gender effects in psychology published by New Scientist is very helpful in demonstrating how very different in magnitude effects can be.

These would be even better by using sections below to bring out predictions, given a specific effect size e.g. the CLES which tells you what are the chances, if you pick one random male and one random female, of them differing on a given effect in the predicted direction. I.e. It would be good to add a 3rd col. that shows the odds, given 1M and 1F randomly taken, having a difference in the trait in the same direction as the diff. in the means for that trait.

New Sci, Gender effects table

Explanation of the table below: Diff. between males and females (effect size in StdDev units). Colour of the number shows which sex does better. There are 6 more items in (the bottom of) the original table.

TRAIT Effect size Odds of one random pair showing the effect

Gender identity 11.0 - 13.2 -

Sexual orientation 6.0 - 7.0 -

Preference for boy's toys 2.1 0.92

Height 2 0.92

Preference for girl's toys 1.8 0.9

Physical aggression 0.4 - 1.3 0.72

Empathy 0.3 - 1.3 0.71

Fine motor skills 0.5 - 0.6 0.65

Mental rotation 0.3 - 0.9 0.66

Assertiveness 0.2 - 0.8 0.64

TRAIT	Effect size	Odds of one random pair showing the effect
Gender identity	11.0 - 13.2	-
Sexual orientation	6.0 - 7.0	-
Preference for boy's toys	2.1	0.92
Height	2	0.92
Preference for girl's toys	1.8	0.9
Physical aggression	0.4 - 1.3	0.72
Empathy	0.3 - 1.3	0.71
Fine motor skills	0.5 - 0.6	0.65
Mental rotation	0.3 - 0.9	0.66
Assertiveness	0.2 - 0.8	0.64

Alternative stats measures of effect size

Use r (correlation coefficient) as the measure: according to Open Science Collaboration (2015) "Estimating the reproducibility of psychological science" Science vol.349 Issue 6251 pp.910-911 doi:10.1126/science.aac4716

Cohen94 suggests using confidence intervals.

Rough ideas on what size effects are small / medium / large

Micro Small Medium Large Huge or Very big Gigantic

Cases: →

Measures ↓
Time of day effects on school tests [9] Difference between the heights of 15 and 16 year old girls in the USA.[1] M vs. F verbal fluency[2] Difference between the heights of 13 and 18 year old girls in the USA.[3] Female vs. Male height[4]
Learning in
school class vs. 1:1 tutoring[5] Biological sex affects gender identity[6]

Cohen's d e.g. ≤ 0.03 0.2 - 0.3 ≈ 0.5 0.8 - or larger 2 12

[7]
r (correlation coeff.)
[11] - 0.1 - 0.3
≈ 0.1
0.3 - 0.5
≈ 0.3
0.5
≈ 0.5
- -

R²:
% of variance explained - 1% 9% 25% - -

R²: (≈ ≈ Partial eta-squared) - 0.02 0.13 0.26 - -

Generalized eta-squared {η_G^2} [8] - 0.02 0.13 0.26 - -

Partial eta-squared ( $\eta^2$ ) - 0.01 0.06 0.14 - -

Odds ratio - - - - - -

Confidence interval - - - - - -

	Micro	Small	Medium	Large	Huge or Very big	Gigantic
Cases: → Measures ↓	Time of day effects on school tests [9]	Difference between the heights of 15 and 16 year old girls in the USA.[1]	M vs. F verbal fluency[2]	Difference between the heights of 13 and 18 year old girls in the USA.[3]	Female vs. Male height[4] Learning in school class vs. 1:1 tutoring[5]	Biological sex affects gender identity[6]
Cohen's d	e.g. ≤ 0.03	0.2 - 0.3	≈ 0.5	0.8 - or larger	2	12
[7] r (correlation coeff.) [11]	-	0.1 - 0.3 ≈ 0.1	0.3 - 0.5 ≈ 0.3	0.5 ≈ 0.5	-	-
R²: % of variance explained	-	1%	9%	25%	-	-
R²: (≈ ≈ Partial eta-squared)	-	0.02	0.13	0.26	-	-
Generalized eta-squared {η_G^2} [8]	-	0.02	0.13	0.26	-	-
Partial eta-squared ( $\eta^2$ )	-	0.01	0.06	0.14	-	-
Odds ratio	-	-	-	-	-	-
Confidence interval	-	-	-	-	-	-

Lit. refs for the table above

Mike Clark's slides on effect sizes: PDF on his web site (PDF local copy).
New Scientist table of gender-sex effects table Spinney, Laura (2011) "Boy brain, girl brain: How the sexes act differently" New Scientist, no.2802, 8 March 2011 http://www.newscientist.com/article/mg20928021.400-boy-brain-girl-brain-how-the-sexes-act-differently.html
See [1]
See [2]
Bloom84 mentions it.
See [2]
Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.) (Hillsdale, NJ: Erlbaum)
Also: Field,Andy (2013) Discovering Statistics using IBM SPSS 4th edition p.267
Cohen, J. (1988) cited above: p.286ff. for some of the effect sizes e.g. for eta-squared.
Bakeman, R. (2005) "Recommended effect size statistics for repeated measures designs" Behavior Research Methods vol.37 no.3 pp.379-384 doi:10.3758/BF03192707; criticising Cohen (1988) (see [7]) pp.413-414.
Sievertsen et al. 2016
Cohen,J. (1992) "A power primer" Psychological Bulletin Vol.112. No.1 pp.155-159 doi:10.1037/0033-2909.112.1.155 In this paper, Cohen expresses small, medium and large effect sizes, not as a range of values, but in terms of a central value for each category.

Other notable cases include:

Very big. Particle physicists require a (Cohen's d) effect size of 5 (between "Very Big" and "Gigantic") before declaring the discovery to have been established of a new particle (e.g. the Higgs boson) (wikipedia entry).
Very small. This paper about the effect on school tests of the time of day when the child takes it, Sievertsen et al. 2016, has a huge dataset, impeccable stats, and goes the extra mile by providing information to help judge the data against relevant contextual standards. But the effect sizes are still minute: a new category of "micro". It reports a variety of (Cohen's d) effect sizes in the range 0.03 - 0.005.

Different meanings for "Important", and so for "effect size"

The basic idea is that sigDiffs at best only tell you how likely it is that something is a true effect, but we should be asking how important a finding is; and "effect size" is meant to be a measure of this.

But there are different senses of "important" for effects.

Some different senses of "how important" (A plan for future subsections)

First: senses of the basic 'how important is this effect'?
Contextual issues: what comparisons in each given context are relevant in interpreting whether a given shift (in the mean) is important.
1. Comparing to what other effects / causes? Even if the experiment did not directly compare to some reference condition, this may be what most readers would find meaningful e.g. comparing a drug to "treatment as usual"; comparing an anti-depressant to regular exercise (now surely the most relevant comparison).
2. Comparing in what units? e.g. in education, giving an effect size in statistical units may not be as meaningful as in an external unit such as grades e.g. "the intervention increased learners' grades on average by 0.5 (or by 2) grades".
3. Comparing in what units? e.g. in education, giving an effect size in statistical units may not be as natural as reporting in an intrinsic unit e.g. the intervention resulted in the absolute amount learned (normalised gain) in the experimental group was 2 (or in fact 3) times that learned in the control group (Crouch & Mazur 1993?).
What proportion of the sample (or population) did the effect apply to?
Somewhere have a section on really big and really small specifically educational effects (including chick-sexing).
Cost-effectiveness. In almost any practical context it isn't just an issue of how much good you do, but also of how cheaply it can be done — because with a given amount of resource, you can do more good to more people by a cheap effect than by an expensive one.
Unexpectedness w.r.t. theory

What proportion of the sample (or population) did the effect apply to? [C] #idiff

Guillaume Rousselet in this blogpost raises this point. If the variability in the data is just "noise" in the measurement process, then the effect is actually true of all participants, but measurement inaccuracies have blurred this; and "effect size" is just a metric of measurement quality. However if the variability is in the mechanism, and particularly in the common but very important case of it being in the participants, then it is of great interest e.g. a drug works on some people, but not others (e.g. because of differing genetic heritage, or differing lifestyle protective factors, etc.). This is common in looking at the effects of treatments or interventions; but equally important for cases of "natural immunity" where for some people, exposure doesn't lead to illness. This applies in many areas not just disease. E.g. In Milgram's experiment, most did what the man in the white coat said despite their qualms, but (just as importantly in Milgram's view) 15% were "immune" and did not. Similarly priming and expectation effects may well work on some and not on others: which is quite different from their working uniformly but weakly on all. Or in positive psychology, gratitude exercises reliably raise well-being in most, yet strongly religious or spiritual individuals seem "immune".

In these cases, it is of great interest to report what the proportion is of the sample to which it applies. You may or may not agree with Rousselet's labelling of this as a type of "effect size", but perhaps you should in any case consider doing this routinely. Reporting the percentage is useful; a sign test gives the probability of this being by chance.

Note that in this case, really the tacit hypothesis is that the distribution is not normal, but bimodal: a population for which the effect "works", and another for which it doesn't.

Basic approach: StdDev units

If you have a result with a strong p-value and good effect size, then what does it predict specifically? This section is to explore the arithmetic of the normal distribution to spell this out.

From the wikiP page on the Normal distribution;
Section on standard deviation and tolerance intervals
As mentioned above, elementary particle physicists use the convention of only declaring a discovery when the effect size is 5 (or more): 99.99994% of the data is within 5 SDevs of the mean (or the chance of observing such a data set by chance is about 1 in 1,700,000 or nearly one in two million).

Given a specific ES, show what % of the old data would be passed.
The table above is 2-tailed. But with a proven effect with an effect size of 2 (for example) we probably want to know how much of the original data is above 2 stdDevs (one-tailed). To calculate this:
- Go to this online calculator;
- (Leave the mean set to 0, the SDev to 1, and the radio button "Above" selected;)
- Fill in "2" in the box beside "Above", and click "Recalculate".
It will show the "Area" as 0.0228, which means 2.28% of the data is more than 2 SDevs greater than the mean.
So if you applied an intervention with an effect size of +2, then without it only 2.28% of a sample score that highly or better, but with it 50% will score that highly.
If I took one Participant from the midpoint (mean = median) of the old distribution, applied the intervention to her/him alone, then where would we expect them to move to relative to the old distribution? == They were at percentile 50, they would move to which percentile? [E.g. ES = 2, then percentile 50 --> 98; rank 50 --> 2nd] [E.g. ES = 1, then percentile 50 --> 84; rank 50 --> 16]
[weather extreme events] Given a (small) shift in the mean, what does this do for frequency of datum-s up at the extreme of a distribution? E.g. 100 year storms.
Odds, in gender effects, of randomly picking 1 M, then 1 F and having the difference on the measure be in the expected direction. Or not. Equally for any effect, pick one value from control group and one from expt group: what chance of these two data being in the expected order.

Then say:
How much in percentiles does a given ES mean?
Or for A vs. B: for a given ES, what are the odds of a random A and B that
they show the right direction for the overall effect?

CLES: chances of a random pair showing the effect

CLES = common language effect size. It is the odds (or chance) of picking one random individual from each of the control and expt. groups, and finding that the direction or size of the mean difference of the groups applies to those cases. See Coe: It's the Effect Size, Stupid.

How much of the variance is explained

Other sections / issues about effect size

Cases with small stats effect size, yet importance of other kinds

Sievertsen

Claims his effect is ethically important. This would be an example of another kind of importance. However his particular claim is false: spending money on levelling a minute injustice is to fail to spend it on more important educational injustices such as parental wealth, parental support of education in the home by their attitudes, etc.

Carrying plants out of hospital wards at night, back again during the day.

Unexpectedness w.r.t. theory [F]

Lecture theatre seating position (Perkins & Wieman, 2005). Sig. (just about); low priority in practice; but important theoretically because unexpected and we have no good explanation for this (small) effect.

I.e. if you want to do immediate practical good to learners, then low ES means it isn't a good investment. However from the viewpoint of theoretical rather than applied research, then the more unexpected an effect, the more valuable it is regardless of effect size. Put another way: a datum which is highly expected has little information value, but one which is highly unexpected (if it can be trusted) has very high information content. Put still another way: for theory-directed research especially, the importance of an effect does not only depend on the size of the difference between an actual observation and a theory-predicted observation but also on the confidence in the surprising measure and the confidence in the prediction.

This is also a way to begin to think about how to reason about the relative value of different bits of qualitative research.

Cost effectiveness [E]

Aveyard and the 30 sec. GP consultation. Contains a CE analysis; and importance depends strongly on this. Not a very large effect, but well worth the money for the effect. Similarly the Open University strengths intervention.

Aveyard et al. (2016) "Screening and brief intervention for obesity in primary care: a parallel, two-arm, randomised trial" The Lancet doi:10.1016/S0140-6736(16)31893-1

Wright Brothers

In Engineering, there are cases where a single design was so obviously superior to all previous ones that statistics are irrelevant, and in fact silly. In such cases everyone following imitates most features of that design. And they do not wait cautiously for evidence about its success: the first demonstration is convincing, and those who wait are simply the ones who don't become contributors to the field. The Wright Brothers aircraft, and Stephenson's Rocket are 2 such examples.

"They made the first controlled, sustained flight of a powered, heavier-than-air aircraft on December 17, 1903, four miles south of Kitty Hawk, North Carolina, USA." "The brothers' fundamental breakthrough was their invention of three-axis control, which enabled the pilot to steer the aircraft effectively and to maintain its equilibrium. This method became and remains standard on fixed-wing aircraft of all kinds."

By 1909? the modern aircraft had arrived. By 1910 it had been used usefully by the military (both reconnaissance, and dropping bombs).

Robert Stephenson's "Rocket" steam railway locomotive (1829):

Single pair of driving wheels.
* Multiple fire tube boiler
* Blast pipe: using exhaust steam to force a draft through the fire and boiler.
Less vertical, more nearly horizontal cylinders: better ride of the loco on the track.
* Pistons directly connected to driving wheels (no gearing used).
Firebox separate from boiler. I.e. water jacket round firebox was primary heating place; firetubes in boiler did additional heating.

Hake survey

Hake's survey is convincing without thinking about effect size statistically. Why? The overwhelming preponderance? ....

Meta-analysis

Here is an example of trying to influence practice (and policy) using effect sizes reported from meta-analyses.

https://educationendowmentfoundation.org.uk/evidence-summaries/teaching-learning-toolkit/

Misc.

Odds ratio

I need to understand this; to understand why wikiP says it is a measure of ES, and integrate it in this page.

Effect size for non-parametric stats

https://www.researchgate.net/post/How_can_I_calculate_the_effect_sizes_of_small_samples_20_for_non-parametric_tests_Wilcoxon_and_Mann-Whitney_U
Conroy,R.M. (2006) "What hypotheses do 'nonparametric' two-group tests actually test?" The Stata journal vol.12 no.2 pp.182-190 http://www.stata-journal.com/article.html?article=st0253
http://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/nonpz This suggests analogy to Cohen d: for Mann-Whitney, = z /sqrt(N), where N is total N of Ps, and z is the stat from Mann-Whitney.
Newcombe RG (2006) Confidence-intervals for an effect size measure based on the Mann-Whitney statistic. Part 1:General issues and tail-area-based methods. Statistics in Medicine vol.25 pp.543-557 doi:10.1002/sim.2323

Notes to be addressed

Coe: It's the Effect Size, Stupid

http://daniellakens.blogspot.co.uk/2015/01/always-use-welchs-t-test-instead-of.html

References

Aveyard et al. (2016) "Screening and brief intervention for obesity in primary care: a parallel, two-arm, randomised trial" The Lancet doi:10.1016/S0140-6736(16)31893-1

Bakeman, R. (2005) "Recommended effect size statistics for repeated measures designs" Behavior Research Methods vol.37 no.3 pp.379-384 doi:10.3758/BF03192707; [criticising Cohen (1988) pp.413-414. ]

Bloom, B.S. (1984) "The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring" Educational Researcher vol.13 no.6 (Jun. - Jul., 1984) pp.4-16 www.jstor.org/stable/1175554

Cohen, J. (1988) Statistical power analysis for the behavioral sciences (2nd ed.) (Hillsdale, NJ: Erlbaum)

Cohen,J. (1992) "A power primer" Psychological Bulletin Vol.112. No.1 pp.155-159 doi:10.1037/0033-2909.112.1.155

Crouch, C.H. and Mazur, E. (2001), "Peer Instruction: Ten years of experience and results", American Journal of Physics, vol.69, no.9 pp.970-977 doi: 10.1119/1.1374249 Also available at https://www.usna.edu/Users/physics/rwilson/_files/documents/peer.pdf http://mazur-www.harvard.edu:16080/publications/Pub_263.pdf

Field,Andy (2013) Discovering Statistics using IBM SPSS 4th edition p.267

Hake,R.R. (1998) "Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses" Am.J.Physics vol.66 no.1 pp.64-74 PDF copy

Hattie, John A.C. (2009) Visible learning: a synthesis of over 800 meta-analyses relating to achievement (London: Routledge) GU lib record=b2650342

Myburgh,S.J., (2016) Critique of Hattie

Perkins,K.K. and Wieman,C.E. (2005) "The Surprising Impact of Seat Location on Student Performance" The Physics Teacher vol.43 January pp.30-33 doi:10.1119/1.1845987

Sievertsen et al. (2016) "Cognitive fatigue influences students' performance on standardized tests" PANAS (Proc. National Academy of Sciences of the USA) doi:10.1073/pnas.1516947113

Beyond effect sizes

There is a new argument emerging that effect sizes do not solve the most basic problems with significance testing, which (so it goes) are:

Picking a p-value of 0.05 is arbitrary and just a convention that doesn't suit all situations. The same is just as true of effect size categories of "medium" etc., as Cohen admitted.
It doesn't solve the issue that experiments should be able to tell us when there is NO effect as well as when there is one i.e. they should be able to quantify the certainty derived from negative as well as positive evidence.

It was Fisher, working on stats for plant genetics, who introduced the use of p-values and conventions. At the same time Jeffreys, working on stats for geological samples, introduced a different approach as expounded in the refs below.

Zoltán Dienes (2008) Understanding Psychology as a Science: An Introduction to Scientific and Statistical Inference (Palgrave Macmillan: London) GU lib record=b2669949

Web site logical path: [www.psy.gla.ac.uk] [~steve] [best] [this page]
[Top of this page]