14 April 2019 ............... Length about 4,000 words (32,000 bytes).
(Document started on 18 Mar 2016.)
This is a WWW document maintained by
Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/apr/apr.html.
You may copy it.
How to refer to it.
Web site logical path:
[Niall Barr's software]
Assessment by pairwise ranking
Assessment by pairwise ranking (APR), also referred to by various other terms
e.g. ACJ, has recently emerged from the school sector as a radical departure
in assessment, only recently made feasible by technology.
In APR, instead of reading each student script once and deciding its mark,
markers see pairs of scripts and decide which should rank above the other on a
single complex criterion. Software assembles these pairwise judgements
into a quantitative interval scale (based on Thurstone's "law of comparative
judgement"). Finally, if used for assessment rather than only for ranking,
grade boundaries are superimposed on the rank
Controlled studies using professional markers employed by school
exam boards have shown that marking in this way gives much higher (NOT lower)
reliability, and that for large numbers of scripts, the total time taken is
less. The statistics can directly identify when sufficient consensus has
been reached; which scripts generate most disagreement (send them for a second
and third opinion); and which markers agree least with other markers.
Originally designed for higher reliability (repeatability, and so fairness)
of marks and reduced costs, it can also collect feedback comments.
The most interesting underlying issue is that APR is in complete contrast to
the assessment approach which is currently, by default, the dominant one of
breaking a judgement down into separate judgements against multiple explicit
criteria, which at least has the virtue of supporting usefully explicit and
diagnostic feedback to learners. Instead, APR uses a single complex criterion
for marking. However in many ways, real academic values have a large implicit
aspect; and furthermore, are holistic instead of being always and simply
reductionist. APR is particularly appropriate for portfolios of work, and for
work where different students may choose to submit in different media (e.g.
printed, web pages, audio tapes).
It should also be noted that APR seems to be, just as Thurstone argued, more
natural psychologically. Thus it may be of use to adopt as a method even
without any computational support; or when the software simply presents the
cases to be compared without optimising how many comparisons are done.
- Software implementers would probably be well advised if they
separated the screen-presentation software module from the statistics and
- Educationalists should actively and persistently consider whether APR
should be adopted for assessment and evaluation in many contexts, and not just
to reduce marking workload or improve reliability on standard assessment types.
Certainly some academics who have read about it, now do paper marking in this
way: using pairwise comparisons to sort the whole set into rank order, and
then deciding where to put the grade boundaries.
Gonsalvez et al. (2013), in another context, report a method for supervisors
to give summative ratings of competence to students on professional field
placements, that use a standard set of (4) vignettes (for each of 9 domains).
The supervisor selects the vignette that is closest to the student's
performance on this aspect.
Their argument and evidence is that this leads to more consistency across
raters, and less bias of kinds where supervisors seem prone to rate a student
overall and then attribute that mark to that student across all domains.
Implying that a disadvantage of numerical scores is that they lead
psychologically to LESS precision and discrimination than un-numbered
The Thurstone foundation for APR seems also to have a close link to David
Nicol's recent studies of what goes on when students perform (reciprocal) peer
critiquing. He finds that when a student generates a judgement or critique of
another student's work, when they themselves have just performed the same task
e.g. their own version of a piece of courswork, then they absolutely cannot
prevent themselves doing an inner comparison of their own work with the other
student's (a paired comparison); and that they generate a lot of useful
thoughts about their own work from that, even when neither asked nor required to
Links to Niall's software
- Question Mark
- Joe's homegrown lash-up
- MyCampus: electronic submission
USPs: my list of the distinct features that make APR / ACJ important
- This method of marking "scales" as the number of students goes up i.e.
the time and effort required of human markers to do the marking reduces per
- Working versions of the software have been built, tested, and used;
and by more than one person / organisation.
(Also constructed and used for conference talk refereeing at GU.)
- A major experiment has been done and published, using professional markers;
supporting the key claims (Pollitt, 2012). [Give some numbers]
This paper additionally reports an important qualitative datum: that the
markers were highly sceptical (did the experiment for the money, at standard
professonal rates for marking) but came to see it as better as well as faster
than their traditional way of doing marking).
- The method has a compelling psychological naturalness.
(Not surprising since it derives from an old psychological theory.)
- Can easily mark cross-media (where different students submit in different
- Can easily mark multi-media (where each student uses several media e.g.
pictures and text). E.g. in portfolios.
- Can easily be used for/with unusual, subjective, and implicit marking
criteria. E.g. quality of musical performance, competence in professional
medical practice; or giving a talk to be judged on the extent to which it
"sounds like a professional psychologist".
- Can be used with one complex criterion OR remarked on each of several
- Can be used by matching against vignettes (carefully created standard
examples that new work is matched against).
[Gonsalvez et al. 2013] Gosalvez also shows that APR can be used for
competence assessment of practical, field activities (e.g. in medicine,
vetinary training, ...); and not only for marking to separate students into a
ranking of how good each is.
But this is only a novel extension of something likely to be used in many
cases: of seeding the new scripts to be assessed, with old scripts selected to
stand on grade boundaries; so that the ranking produced by the core APR
procedure can be translated into grades.
- Can be used with a set of markers, to get them to converge as far
as possible; or to do the job as fast as possible when different markers
contribute different amounts at different times (as may often be the case with
refereeing papers submitted to conferences).
- Can be used for judging by teachers or by peers.
- Can be used to see which markers deviate most from the other markers.
Can be used to see which scripts attract the least consistent ratings
=> I.e. to give a detailed, multi-faceted report on "reliability".
Separate intellectual ideas
- Thurstone: a better model of the psychological process of judgement
- Nicol: involuntary comparative judgements.
All measurement is relative in the end.
- Real task analysis of what human markers are doing in detail when marking
- Bramley & Oates (2010)
- Gonsalvez, C. J., Bushnell, J., Blackman, R., Deane, F., Bliokas, V.,
Nicholson-Perry, K., . . . Knight, R. (2013)
"Assessment of Psychology Competencies in Field Placements: Standardized
Vignettes Reduce Rater Bias"
Training and Education in Professional Psychology
vol.7 no.2 pp.99-111 doi:10.1037/a0031617
- Plimmer,Beryl & Apperley,M.D. (2007) "Making paperless work"
Proceedings of the 7th ACM SIGCHI New Zealand chapter's international
conference on Computer-human interaction: design centered HCI 2007: pp.1-8
- Plimmer,B. & Mason,P. (2006) "A pen-based paperless environment for
annotating and marking student assignments" PROC.7TH AUSTRALASIAN USER INTERFACE
CONFERENCE, CRPIT PRESS pp.37-44
- Pollitt,A. (2004) "Let's stop marking exams" presented at IAEA conference.
Also available at:
- Pollitt,A. (2012) "The method of Adaptive Comparative Judgement"
Assessment in Education: Principles, Policy & Practice
Vol.19 no.3 pp.281-300
Correction to an equation (published vol.19, no.3, p.387 doi:
Thurstone, L.L. 1927a. "A law of comparative judgment"
Psychological Review vol.34 no.4 pp.273-286
[No doi, but online from GU library]
Reprinted in L.L.Thurstone (1959)
The measurement of values Chapter 3
(Chicago, IL: University of Chicago Press)
[This is a general and technical statement of "A law of comparative
judgment"; with some maths.]
Thurstone, L.L. 1927b. "Psychophysical analysis"
The American Journal of Psychology vol.38 no.3 pp.368-89
[This discusses at some length its application to perceptual judgements.]
Thurstone, L.L. 1931. "Measurement of change in social attitude"
Journal of Social Psychology vol.2 no.2 pp.230-5
[This briefly describes its use in measuring social attitudes.]
More references (received from Paul Anderson)
- Ajjawi, R., & Bearman, M. (2018).
"Problematising standards: representation or performance?"
ch.4 pp.57-66 in
David Boud, Rola Ajjawi, Phillip Dawson, Joanna Tai (eds.)
Developing Evaluative Judgement in Higher Education
- Barrada, J. R., Olea, J., Ponsoda, V., & Abad, F. J. (2010)
"A method for the
comparison of item selection rules in computerized adaptive testing"
Applied Psychological Measurement 34(6), 438-452.
- Bloxham, S., den-Outer, B., Hudson, J., & Price, M. (2016)
"Let's stop the pretence of consistent marking:
exploring the multiple limitations of assessment criteria"
Assessment & Evaluation in Higher Education, 41(3), 466-481.
- Bradley, R. A., & Terry, M. E. (1952).
"Rank analysis of incomplete block designs: I. The method of paired comparisons"
Biometrika 39(3/4), 324-345.
- Bramley, T., & Wheadon, C. (2015, November).
"The reliability of Adaptive Comparative Judgement"
In Paper presented at the AEA-Europe annual conference
(Vol. 4, p. 7).
- Brinker, C., Mencía, E. L., & Fürnkranz, J. (2014, December)
"Graded multilabel classification by pairwise comparisons"
In 2014 IEEE International Conference on Data Mining
pp. 731-736 IEEE.
- Buse, R. P., & Weimer, W. R. (2008, July)
"A metric for software readability"
In Proceedings of the 2008 international symposium on Software testing and
analysis (pp. 121-130). ACM.
- Hardy, J., Galloway, R., Rhind, S., McBride, K., Hughes, K., & Donnelly, R.
"Ask, answer, assess" Higher Education Academy.
- Kimbell, R. (2008). "E-assessment in project e-scape"
Design and Technology Education: An International Journal, 12(2).
- McKenzie, Ross (2018) "Progressive Adaptive Comparative Judgement"
- McKenzie, Ross
"A python implementation of Progressive Adaptive Comparative Judgement"
- Negahban, S., Oh, S., & Shah, D. (2012).
"Iterative ranking from pair-wise comparisons"
In Advances in neural information processing systems (pp. 2474-2482).
- Relf, P. A. (2004).
Achieving software quality through source code readability
Quality Contract Manufacturing LLC.
- Wauthier, F., Jordan, M., & Jojic, N. (2013, February).
"Efficient ranking from pairwise comparisons"
In International Conference on Machine Learning (pp.109-117).
Adaptive Comparative Judgement (ACJ) is the term used by Pollitt in the most
important publication so far. However it describes the software, not the human
process nor the psychology of it. The software does the adaptation, the human
judges do not. The humans just do pairwise comparisons or (more exactly)
The theory, following Thurstone, assumes that there is an implicit,
psychologically real, scale in the minds of the humans, which is not directly
accessible by them (through introspection or reasoning), but reveals itself as a
consistent pattern in their judgements.
Furthermore: that this is true of complex judgements, and that these are not
helped by attempting to break them down into multiple component criteria
which must then be combined again to reach a mark; almost always, by a
simplistic arithmetic operation such as adding marks, which generally does not
reproduce the academic value judgements actually made by the experts
TR = Thurstone ranking [Grove-Stephenson]
TP = Thurstone pairs [Mhairi]
TS = Thurstone scaling [a web page on psych methods]
LCJ = Law of comparative judgement (Thurstone's "law")
CJ = Comparative Judgement; "Thurstone's method of CJ". [Pollitt]
TPCJ = Thurstone Paired Comparative Judgements; & "Thurstone's method of CJ".
DCJ = Direct Comparative Judgement [Pollitt]
PC = Pairwise (or Paired) Comparison [Bramley]
ROM = Rank Ordering Method [Bramley]
PCM = Pairwise (or Paired) Comparison Methods [Bramley]
*APR = Assessment by Pairwise Ranking
ADCJ = Assessment by Direct Comparative Judgement [Pollitt]
PRS = Pairwise Ranking Scales
PRTS = Pairwise Ranking Thurstone Scales
PCR = Pairwise Comparative Ranking [my term, but is best, avoids abbrev. PC]
PCRS = Pairwise Comparative Ranking Scales [my term]
ACJ = Adaptive Comparative Judgement [common]
Currently preferred terms:
APR for general overall process; or
ACJ for the software-optimised cost-saving version. CJs for individual judgements.
N.B. comparative & judgement are redundant [not quite because judgement can be absolute]
comparative & ranking are redundant [TRUE]
Strictly, Thurstone scaling produces a set which produces more than ranking: an
A vision. Or a wish, anyway.
I'd like to see a UI (user interface) allowing use of the APR process without
displaying the objects on screen (e.g. handwritten exam scripts): just have a
unique ID on a post-it on each script, and the software selecting pairs
to compare, receiving the judgements, receiving the formative comments if any.
While many things can be usefully digitised and presented on a computer, this
will probably never be true of all the objects we may need to judge.
Here's another thought I'd like to share.
Pollitt's work made a deep impression on me, in making me think about the real
psychological marking process; and how better to work with it, and support it.
But I learned another big lesson, which might perhaps be combined with APR,
from Beryl Plimmer (ref.s above):
Plimmer worked in HCI (Human Computer Interaction) and did a real, proper task
analysis of what is involved in marking a heap of first year programming
assignments. Firstly: the student submissions were zip files, with many parts
(code, output of the test suite, documentation ....). This is another case
where current UIs to ACJ engines just won't cut it. Plimmer's software opened
all the zipped files at once in separate windows (and probably had careful
Secondly, she recognised that the marker had to input multiple DIFFERENT
things in response, in DIFFERENT documents (and formats), and for DIFFERENT
audiences: the mark to admin people; feedback to the student; markups on the
code itself, ..... And she used a single pen input device (and then character
recognition to turn it into digital stuff) to save the user switching from
mouse to keyboard constantly.
Thirdly, this made me reflect on my own marking (of piles of essays in
psychology) and why I never felt software would help because I use a big
table (enough for at least 4 but preferably 6 people to sit at) exactly
because there is so much stuff that needs to be visible/writeable at once, and
computer screens are so pathetically small compared to a table.
(But Plimmer shows that you CAN use a screen but that special code for opening
and managing the windows automatically makes a considerable difference.)
In fact, in my own case I generally have 4 "outputs" on different bits of paper:
- Comments to return to the student
- My private comments to myself e.g. to use when discussing my marks with a
- Mark sheet to the Admin staff
- A sheet with any ideas and refs the student used that I want to follow up
for myself, because sometimes students have things to teach me, and/or trigger
off new thoughts in me that, as a subject specialist, I don't want to forget.
Exams, especially in final year options, often cause learning in the teacher.
This is not only illuminating but shows that the ACJ UIs up to now could with
profit be seriously expanded. It's been shown that ACJ can usefully include
the collection and use of formative feedback comments.
It doesn't tackle the real, full needs of marking in general.
When assessing complex work (portfolios, sculptures, music ...),
it is likely that the assessor will want to make extensive notes
to self; and in later rounds of comparisons, re-view not the original work,
but their notes on it.
So: I think there are really several distinct directions in which Pollitt's
work has contributed, and could be taken further.
A) Algorithms that cut down the work of marking by making economies of scale
B) More reliable (accurate, repeatable) marking.
C) Psych. insight into how assessment really works in the human mind; and hence
how to be more realistic about it, and more aware of what it does well vs.
D) Consequent educational ideas about how to understand and improve
assessment processes with both lower costs and higher benefits.
E) Further educational applications. E.g. if software was good enough and
cheap enough, we could run exercises with students as the markers to:
This is an example of how learning often involves becoming expert,
which in turn involves not just learning concepts and facts,
and how to reason about them;
but also learning new perceptual skills, and new execution skills (fast
sentence composition, ....).
- Train them to make the same judgements as experts, where appropriate and
- Show them how judges do and don't differ from each other.
- Possibly, show students how their judgements are much better (though less
confident) than they think. ....
F) We now know quite a lot about the software requirements, that wasn't known
- The modularisation / architecture: e.g. separation of object presentation;
choice of which pair to look at; and the other aspects of the UID.
- Flexibility in display, and whether to display it at all.
- Variations in the algorithm; how to deal with worst case ordering of
judgements e.g. a rogue marker; or an unusual object; or one wrong judgement
- The need for multiple outputs from the marker, at least in some cases; and
how to make a much more efficient UI for this.
- How to handle multiple inputs from one learner product as in Plimmer's
A paper on requirements might really be contribution ....
With hindsight, I would now (March 2019) recommend that software for APR is
organised into 4 well-separated major modules.
- The display part of the user interface.
The "scripts" may be displayed on screen, or not. This module must be ready for
dealing with a display of nothing, or 1 document (the canonical case),
or many documents. The latter is not an exotic case.
Plimmer (2006) dealt with student submissions of 5 related
documents, just for level 1 computer programming.
Furthermore, they should be displayed in a carefully designed window with
several panes of different custom sizes.
Displaying "nothing" i.e. having an input-only interface would be useful
in going round a sculpture park, assessing the exhibits which are much larger
than the computer.
However also common in such cases is displying a token reminder e.g. a
thumbnail, to remind the user which case they are judging / comparing.
A common variant of this will be when the user observes each case separately
and makes notes on them; then these notes should be displayed during comparison
phase, to remind the user. So the display module needs to be able to accept
documents from the judges to display here (not just from the authors or
students being assessed).
An example could be the sculpture park; but also (for example) judging which
is the best PhD of the year, where reading/ skimming each PhD would be done in
an earlier phase, followed by mulitiple comparisons and judgements.
- The user input part of the user interface.
Another big insight from Plimmer is that the output of marking is OFTEN
multiple output documents (marks vs comments in many cases; but also notes for
later use in moderation discussions, or private notes on ideas to follow up,
stimulated by a good exam answer).
- Statistics calculations engine
Calculation of best estimates of ordering and distances; and calculating
orderings of objects, of markers, ....
Follow a ref. to a related literature in psychophysics.
The "staircase" procedure there is also about the "smartest" way to zero in on
small differences between measurements.
Lamming. (papers in HEA subject area ? ....
Kingdom and Prins book "psychophysics"; ch. on "Adaptive methods"
- Decision-making and Optimisation
(about what pairs should be offered for comparison at each point)
This module chooses which pair to present next for judgement, given what is
known at this point in the judging process.
This is the key function for reducing effort per script to be assessed.
This needs to have broader functionality than early software to allow
differing modes e.g to cope with unknown availability of markers, how to use
a new marker coming on board later than others, etc.
Niall Barr's software implementing APR / ACJ
[Should this be moved somewhere else?]
Link to my notes on (using) Niall Barr's software
implementing APR / ACJ
Category, ordinal, interval, ration-scale
Web site logical path:
[Top of this page]