21 June 2019 ............... Length about 5,000 words (39,000 bytes).
(Document started on 18 Mar 2016.)
This is a WWW document maintained by
Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/apr/apr.html.
You may copy it.
How to refer to it.
Web site logical path:
[Niall Barr's software]
Assessment by pairwise ranking
Brief contents list
Assessment by pairwise ranking (APR), also referred to by various other terms
e.g. ACJ, has lately from the school sector as a radical departure
in assessment, only recently made feasible by technology.
In APR, instead of reading each student script once and deciding its mark,
markers see pairs of scripts and decide which should rank above the other on a
single complex criterion. Software assembles these pairwise judgements
into an ordering (an ordinal scale). However by applying
Thurstone's "law of comparative judgement" the software further calculates a
quantitative interval scale -- allowing it to calculate what further
comparisons will yield the most or least additional information, and so allow
optimisations that for large numbers (20 scripts is a very rough threshold)
reduce the total marking work.
Finally, if used for assessment rather than only for ranking, grade boundaries
are superimposed on the rank order.
Controlled studies using professional markers employed by school
exam boards have shown that marking in this way gives much higher (NOT lower)
reliability, and that for large numbers of scripts, the total time taken is
less. The statistics can directly identify when sufficient consensus has
been reached; which scripts generate most disagreement (send them for a second
and third opinion); and which markers agree least with other markers.
Originally designed for higher reliability (repeatability, and so fairness)
of marks and reduced costs, it can also collect feedback comments.
The most interesting underlying issue is that APR is in complete contrast to
the assessment approach which is currently, by default, the dominant one of
breaking a judgement down into separate judgements against multiple explicit
criteria, which at least has the virtue of supporting usefully explicit and
diagnostic feedback to learners. Instead, APR uses a single complex criterion
for marking. However in many ways, real academic values have a large implicit
aspect; and furthermore, are holistic instead of being always and simply
reductionist. APR is particularly appropriate for portfolios of work, and for
work where different students may choose to submit in different media (e.g.
printed, web pages, audio tapes).
It should also be noted that APR seems to be, just as Thurstone argued, more
natural psychologically. Thus it may be of use to adopt as a method even
without any computational support; or when the software simply presents the
cases to be compared without optimising how many comparisons are done.
- Software implementers would probably be well advised if they
separated the screen-presentation software module from the statistics and
- Educationalists should actively and persistently consider whether APR
should be adopted for assessment and evaluation in many contexts, and not just
to reduce marking workload or improve reliability on standard assessment types.
Certainly some academics who have read about it, now do paper marking in this
way: using pairwise comparisons to sort the whole set into rank order, and
then deciding where to put the grade boundaries.
Gonsalvez et al. (2013), in another context, report a method for supervisors
to give summative ratings of competence to students on professional field
placements, that use a standard set of (4) vignettes (for each of 9 domains).
The supervisor selects the vignette that is closest to the student's
performance on this aspect.
Their argument and evidence is that this leads to more consistency across
raters, and less bias of kinds where supervisors seem prone to rate a student
overall and then attribute that mark to that student across all domains.
Implying that a disadvantage of numerical scores is that they lead
psychologically to LESS precision and discrimination than un-numbered
The Thurstone foundation for APR seems also to have a close link to David
Nicol's recent studies of what goes on when students perform (reciprocal) peer
critiquing. He finds that when a student generates a judgement or critique of
another student's work, when they themselves have just performed the same task
e.g. their own version of a piece of courswork, then they absolutely cannot
prevent themselves doing an inner comparison of their own work with the other
student's (a paired comparison); and that they generate a lot of useful
thoughts about their own work from that, even when neither asked nor required to
do so (Nicol 2018).
Links to Niall's software
Related software *
- Question Mark
- Joe's homegrown lash-up
- MyCampus: electronic submission
My list of the 13 distinct features that make APR / ACJ important
As the number of students goes up, this method of marking "scales"
i.e. the time and effort required of human markers to do the marking reduces
- Working versions of the software have been built, tested, and used;
and by more than one person and in more than one organisation.
(Also constructed and used for conference talk refereeing at Glasgow
- A major experiment has been done and published, using professional markers;
supporting the key claims (Pollitt, 2012). [*
Give some numbers]
This paper additionally reports an important qualitative datum: that the
markers were highly sceptical (did the experiment for the money, at standard
professonal rates for marking) but came to see it as better as well as faster
than their traditional way of doing marking).
- The method has a compelling psychological naturalness.
(Not surprising since it derives from an old psychological theory.)
- It can easily be arranged to collect comments at the same time e.g. to
generate formative feedback as well as summative marks.
- It can easily mark cross-media (where different students submit in different
- It can easily mark multi-media (where each student uses several media e.g.
pictures and text). E.g. in portfolios.
- It can easily be used for/with unusual, subjective, and implicit marking
criteria. E.g. quality of musical performance, competence in professional
medical practice; or giving a talk to be judged on the extent to which it
"sounds like a professional psychologist".
- It can be used with one complex criterion OR remarked on each of several
- It can be used by matching against vignettes (carefully created standard
examples that new work is matched against).
[Gonsalvez et al. 2013] Gosalvez also shows that APR can be used for
competence assessment of practical, field activities (e.g. in medicine,
vetinary training, ...); and not only for marking to separate students into a
ranking of how good each is.
But this is only a novel extension of something likely to be used in many
cases: of seeding the new scripts to be assessed, with old scripts selected to
stand on grade boundaries; so that the ranking produced by the core APR
procedure can be translated into grades.
- It can be used with a set of markers, to get them to converge as
completely as possible; or to do the overall job as fast as
possible when different markers contribute different amounts at different
times (as may often be the case with refereeing papers submitted to
- It can be used for judging by teachers OR by peers.
- It can be used to see which markers deviate most from the other markers.
It can be used to see which scripts attract the least consistent ratings
=> I.e. to give a detailed, multi-faceted report on "reliability".
Separate intellectual ideas
- Thurstone: a better model of the psychological process of judgement
- Nicol: involuntary comparative judgements. (Nicol, 2018)
(All measurement is relative in the end.)
- Plimmer: Real task analysis of the detail of what human markers are doing
- Bramley & Oates (2010)
- Dale, V.H.M., et al, 2019 "Learner experiences of a blended course
incorporating a MOOC on Haskell functional programming"
Accepted for publication in Research in Learning Technology
- Gonsalvez, C. J., Bushnell, J., Blackman, R., Deane, F., Bliokas, V.,
Nicholson-Perry, K., . . . Knight, R. (2013)
"Assessment of Psychology Competencies in Field Placements: Standardized
Vignettes Reduce Rater Bias"
Training and Education in Professional Psychology
vol.7 no.2 pp.99-111 doi:10.1037/a0031617
Nicol, David (2018) "Unlocking generative feedback through peer reviewing"
ch.5 p.47-59 in V.Grion & A.Serbati (2018)
Assessment of learning or assessment for learning? Towards a culture
of sustainable assessment in higher education
- Plimmer,Beryl & Apperley,M.D. (2007) "Making paperless work"
Proceedings of the 7th ACM SIGCHI New Zealand chapter's international
conference on Computer-human interaction: design centered HCI 2007: pp.1-8
- Plimmer,B. & Mason,P. (2006) "A pen-based paperless environment for
annotating and marking student assignments" PROC.7TH AUSTRALASIAN USER INTERFACE
CONFERENCE, CRPIT PRESS pp.37-44
- Pollitt,A. (2004) "Let's stop marking exams" presented at IAEA conference.
Also available at:
- Pollitt,A. (2012) "The method of Adaptive Comparative Judgement"
Assessment in Education: Principles, Policy & Practice
Vol.19 no.3 pp.281-300
Correction to an equation (published vol.19, no.3, p.387 doi:
Thurstone, L.L. 1927a. "A law of comparative judgment"
Psychological Review vol.34 no.4 pp.273-286
[No doi, but online from GU library]
Reprinted in L.L.Thurstone (1959)
The measurement of values Chapter 3
(Chicago, IL: University of Chicago Press)
[This is a general and technical statement of "A law of comparative
judgment"; with some maths.]
Thurstone, L.L. 1927b. "Psychophysical analysis"
The American Journal of Psychology vol.38 no.3 pp.368-89
[This discusses at some length its application to perceptual judgements.]
Thurstone, L.L. 1931. "Measurement of change in social attitude"
Journal of Social Psychology vol.2 no.2 pp.230-5
[This briefly describes its use in measuring social attitudes.]
More references (mostly received from Paul Anderson)
- Ajjawi, R., & Bearman, M. (2018).
"Problematising standards: representation or performance?"
ch.4 pp.57-66 in
David Boud, Rola Ajjawi, Phillip Dawson, Joanna Tai (eds.)
Developing Evaluative Judgement in Higher Education
- Barrada, J. R., Olea, J., Ponsoda, V., & Abad, F. J. (2010)
"A method for the
comparison of item selection rules in computerized adaptive testing"
Applied Psychological Measurement 34(6), 438-452.
- Bloxham, S., den-Outer, B., Hudson, J., & Price, M. (2016)
"Let's stop the pretence of consistent marking:
exploring the multiple limitations of assessment criteria"
Assessment & Evaluation in Higher Education, 41(3), 466-481.
- Bradley, R. A., & Terry, M. E. (1952).
"Rank analysis of incomplete block designs: I. The method of paired comparisons"
Biometrika 39(3/4), 324-345.
- Bramley, T., & Wheadon, C. (2015, November).
"The reliability of Adaptive Comparative Judgement"
In Paper presented at the AEA-Europe annual conference
(Vol. 4, p. 7).
- Brinker, C., Mencía, E. L., & Fürnkranz, J. (2014, December)
"Graded multilabel classification by pairwise comparisons"
In 2014 IEEE International Conference on Data Mining
pp. 731-736 IEEE.
- Buse, R. P., & Weimer, W. R. (2008, July)
"A metric for software readability"
In Proceedings of the 2008 international symposium on Software testing and
analysis (pp. 121-130). ACM.
Glass,R.L. (2003) "About Education" ch.7 p.181-4
in Facts and Fallacies of Software Engineering
Addison-Wesley, Boston, MA.
- Hardy, J., Galloway, R., Rhind, S., McBride, K., Hughes, K., & Donnelly, R.
"Ask, answer, assess" Higher Education Academy.
- Kimbell, R. (2008). "E-assessment in project e-scape"
Design and Technology Education: An International Journal, 12(2).
- McKenzie, Ross (2018) "Progressive Adaptive Comparative Judgement"
- McKenzie, Ross
"A python implementation of Progressive Adaptive Comparative Judgement"
- Negahban, S., Oh, S., & Shah, D. (2012).
"Iterative ranking from pair-wise comparisons"
In Advances in neural information processing systems (pp. 2474-2482).
- Relf, P. A. (2004).
Achieving software quality through source code readability
Quality Contract Manufacturing LLC.
- Wauthier, F., Jordan, M., & Jojic, N. (2013, February).
"Efficient ranking from pairwise comparisons"
In International Conference on Machine Learning (pp.109-117).
Adaptive Comparative Judgement (ACJ) is the term used by Pollitt in the most
important publication so far. However it describes the software, not the human
process nor the psychology of it. The software does the adaptation, the human
judges do not. The humans just do pairwise comparisons or (more exactly)
The theory, following Thurstone, assumes that there is an implicit,
psychologically real, scale in the minds of the humans, which is not directly
accessible by them (through introspection or reasoning), but reveals itself as a
consistent pattern in their judgements.
Furthermore: that this is true of complex judgements, and that these are not
helped by attempting to break them down into multiple component criteria
which must then be combined again to reach a mark; almost always, by a
simplistic arithmetic operation such as adding marks, which generally does
not reproduce the academic value judgements actually made by the experts
TR = Thurstone ranking [Grove-Stephenson]
TP = Thurstone pairs [Mhairi]
TS = Thurstone scaling [a web page on psych methods]
LCJ = Law of comparative judgement (Thurstone's "law")
CJ = Comparative Judgement; "Thurstone's method of CJ". [Pollitt]
TPCJ = Thurstone Paired Comparative Judgements; & "Thurstone's method of CJ".
DCJ = Direct Comparative Judgement [Pollitt]
PC = Pairwise (or Paired) Comparison [Bramley]
ROM = Rank Ordering Method [Bramley]
PCM = Pairwise (or Paired) Comparison Methods [Bramley]
*APR = Assessment by Pairwise Ranking
ADCJ = Assessment by Direct Comparative Judgement [Pollitt]
PRS = Pairwise Ranking Scales
PRTS = Pairwise Ranking Thurstone Scales
PCR = Pairwise Comparative Ranking [my term, but is best, avoids abbrev. PC]
PCRS = Pairwise Comparative Ranking Scales [my term]
ACJ = Adaptive Comparative Judgement [common]
Currently preferred terms:
APR for general overall process; or
ACJ for the software-optimised cost-saving version. CJs for individual judgements.
N.B. comparative & judgement are redundant [not quite because judgement can be absolute]
comparative & ranking are redundant [TRUE]
Strictly, Thurstone scaling produces a set which produces more than ranking: an
Part 2: Beyond Pollitt's work
A vision. Or a wish, anyway.
I'd like to see a UI (user interface) allowing use of the APR process without
displaying the objects on screen (e.g. handwritten exam scripts): just have a
unique ID on a post-it on each script, and the software selecting pairs
to compare, receiving the judgements, receiving the formative comments if any.
While many things can be usefully digitised and presented on a computer, this
will probably never be true of all the objects we may need to judge.
Here's another thought I'd like to share.
Pollitt's work made a deep impression on me, in making me think about the real
psychological marking process; and how better to work with it, and support it.
But I learned another big lesson, which might perhaps be combined with APR,
from Beryl Plimmer (ref.s above):
Plimmer worked in HCI (Human Computer Interaction) and did a real, proper task
analysis of what is involved in marking a heap of first year programming
assignments. Firstly: the student submissions were zip files, with many parts
(code, output of the test suite, documentation ....). This is another case
where current user interfaces (UIs) to ACJ engines just won't cut it.
Plimmer's software opened all the zipped files at once in separate windows
(and probably had careful window/pane management).
Secondly, she recognised that the marker had to input multiple DIFFERENT
responses, in DIFFERENT documents (and formats), and for DIFFERENT
audiences: the mark to admin people; feedback to the student; markups on the
code itself, ..... And she used a single pen input device (and then character
recognition to turn it into digital stuff) to save the user switching from
mouse to keyboard constantly.
Thirdly, this made me reflect on my own marking (of piles of essays in
psychology) and why I never felt software would help because I use a big
table (enough for at least four but preferably six people to sit at) exactly
because there is so much stuff that needs to be visible/writeable at once, and
computer screens are so pathetically small compared to a table.
(But Plimmer shows that you CAN use a screen but that special code for opening
and managing the windows automatically makes a considerable difference.)
In fact, in my own case I generally have four "outputs" on different bits of
- Comments to return to the student
- My private comments to myself e.g. to use when discussing my marks with a
- Mark sheet to the Admin staff
- A sheet with any ideas and refs the student used that I want to follow up
for myself, because sometimes students have things to teach me, and/or trigger
off new thoughts in me that, as a subject specialist, I don't want to forget.
Exams, especially in final year options, often cause learning in the teacher.
Related to Plimmer's insights into improving software to improve the marking
task is how some individuals have created substantial software improvements.
Plimmer's paper gives a published description of her carefully designed system
for marking year 1 programming assignments in her dept. Joe Maguire's
practice is a different kind of course and marking; implemented quite
differently; but done out of ready to hand software adapted to a
well-personalised system for one marker on one course. He has used a
personalised, and personally created, lashup for increasing quality and speed
of his marking for 3? years now.
- "PDF expert": a not-free pdf Viewer with some additional fns and now
on iPad, iPhone, Mac.
Handwriting, pencil tool for signatures, edit text, "stamps".
- iPhone XR (to support pencil and handwriting recog?)
- Cloud storage of some kind. Esp. university's own for privacy
- Apple sync: recently better at sync-ing his iPhone, iPad, desk Mac.
- Handwriting by apple pen (in Pdf docs).
- Safari multi-window pdf viewers.
Teacher-level functions supported
- Comment bank. ?In word doc/pdf?
"Stamps" in PDF expert: you can just superpose them on any place in any PDF
doc. Essentally a short comment you define. Usually one word, can resize
each stamp to taste. This is essentially another form of comment bank.
[Text editing in PDF documents "like word".]
- Cloud and syncing allows him to do bits of marking anywhere. Instead of
having to carve out large lumps of uninterrupted time, can work on it a bit;
save with a few pointers (e.g. underlining) about where he is in a task, then
resume. Not degrading but allowing less pressure on time by making more bits
of time usable; and much cheaper pause/resume of the task.
- Multiple outputs from the marker: marking up student docs; private word
doc to self with private notes (e.g. to be used in discussion); Create
personalised comments to each student (partly from the comment bank);
- He has a doc that is a form with the rubric and grade descriptors,
duplicates it for each student, and marks up the form to apply to that student
- Each year goes back to the comment bank, and reviews what he might do
better in the course (to fend off common errors by students). The ample cloud
storage makes this easy to afford and to do.
Back to broadening our vision for future APR/ACJ designs
This is not only illuminating but shows that the ACJ user interfaces (UIs)
up to now could with profit be seriously expanded. It's been shown that ACJ
can usefully include the collection and use of formative feedback comments.
However this doesn't tackle the real, full needs of marking in general. When
assessing complex work (portfolios, sculptures, music ...), it is likely that
the assessor will want to make extensive notes to self; and in later rounds of
comparisons, re-view not the original work, but their notes on it.
So: I think there are really several distinct directions in which Pollitt's
work has contributed, yet could be taken further.
A) Algorithms that cut down the work of marking by making economies of scale
B) More reliable (accurate, repeatable) marking.
C) Psych. insight into how assessment really works in the human mind; and hence
how to be more realistic about it, and more aware of what it does well vs.
D) Consequent educational ideas about how to understand and improve
assessment processes with both lower costs and higher benefits.
E) Further educational applications. E.g. if software was good enough and
cheap enough, we could run exercises with students as the markers to:
This is an example of how learning often involves becoming expert,
which in turn involves not just learning concepts and facts,
and how to reason about them;
but also learning new perceptual skills, and new execution skills (fast
sentence composition, ....).
- Train them to make the same judgements as experts, where appropriate and
- Show them how judges do and don't differ from each other in judging the
student's own level.
- Possibly, show students how their judgements are much better (though less
confident) than they think. ....
F) We now know quite a lot about the desirable software requirements that
wasn't known before:
- The modularisation / architecture: e.g. separation of object presentation;
choice of which pair to look at; and the other aspects of the UID (user
- Flexibility in display, and whether to display the work at all.
- Variations in the algorithm; how to deal with worst case ordering of
judgements e.g. a rogue marker; or an unusual object; or one wrong judgement
- The need for multiple outputs from the marker, at least in some cases; and
how to make a much more efficient UI for this.
- How to handle multiple inputs from one learner product as in Plimmer's
A paper on requirements might be a real contribution –
see the next section.
With hindsight, I would now (March 2019) recommend that software for APR be
organised into four well-separated major modules.
- The display part of the user interface.
The "scripts" may be displayed on screen, or not. This module must be ready for
dealing with a display of nothing, or one document (the canonical case),
or many documents. The latter is not an exotic case.
Plimmer (2006) dealt with student submissions of 5 related
documents, just for level 1 computer programming.
Furthermore, they should be displayed in a carefully designed window with
several panes of different custom sizes.
Displaying "nothing" i.e. having an input-only interface would be useful
in going round a sculpture park, assessing the exhibits which are much larger
than the computer.
However also common in such cases is displying a token reminder e.g. a
thumbnail, to remind the user which case they are judging / comparing.
A common variant of this will be when the user observes each case separately
and makes notes on them; then these notes should be displayed during comparison
phase, to remind the user. So the display module needs to be able to accept
documents from the users (judges) to display here (not just from the authors
or students being assessed). An example could be the sculpture park; but also
(for example) judging which is the best PhD of the year, where reading/
skimming each PhD would be done in an earlier phase, followed by mulitiple
comparisons and judgements.
- The user input part of the user interface.
Another big insight from Plimmer is that the output of marking is OFTEN
multiple output documents (marks vs comments in many cases; but also notes for
later use in moderation discussions, or private notes on ideas to follow up,
stimulated by a good exam answer).
AND that using a single input device is a major advantage (rather than
switching repeatedly between mouse, keyboard, stylus).
- Statistical calculations engine
1. Calculation of best estimates of ordering and distances.
2. Calculating the orderings of objects, of markers, ....
Follow a ref. to a related literature in psychophysics.
The "staircase" procedure there is also about the "smartest" way to zero in on
small differences between measurements.
Lamming. (papers in HEA subject area ? ....
Kingdom and Prins book "psychophysics"; ch. on "Adaptive methods"
- Decision-making and Optimisation
(about what pairs should be offered for comparison at each point)
This module chooses which pair to present next for judgement, given what is
known at this point in the judging process.
This is the key function for reducing effort per script to be assessed.
This needs to have broader functionality than early software to allow
differing modes e.g to cope with unknown availability of markers, how to use
a new marker coming on board later than others, etc.
Next jobs or studies to do
Niall Barr's software implementing APR / ACJ
[Should this be moved somewhere else?]
Link to my notes on (using) Niall Barr's software
implementing APR / ACJ
Category, ordinal, interval, ratio-scale
See notes on this topic at this page:
Web site logical path:
[Top of this page]