Last changed 6 Oct 2021 ............... Length about 5,000 words (41,000 bytes).
(Document started on 18 Mar 2016.) This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/apr/apr.html. You may copy it. How to refer to it.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [apr] [this page] [popup instructions] [ex1] [ex2] [Niall Barr's software]

Assessment by pairwise ranking (a.k.a. "ACJ")

Brief contents list

Unique Selling Points: My list of the 13 distinct features that make APR / ACJ important
Literature References
Talks by us
Names, terminology
Part 2. Beyond Pollitt's work: a new vision
Joe's lashup
Back to broadening our vision for future ACJ designs
Software architecture
Next jobs or studies to do (e.g. using ACJ to test people's skills at various kinds of judgement)
Misc. notes
Literature References (repeated link)

Assessment by pairwise ranking (APR), also referred to by various other terms e.g. ACJ, has lately from the school sector as a radical departure in assessment, only recently made feasible by technology.

In APR, instead of reading each student script once and deciding its mark, markers see pairs of scripts and decide which should rank above the other on a single complex criterion. Software assembles these pairwise judgements into an ordering (an ordinal scale). However by applying Thurstone's "law of comparative judgement" the software further calculates a quantitative interval scale -- allowing it to calculate what further comparisons will yield the most or least additional information, and so allow optimisations that for large numbers (20 scripts is a very rough threshold) reduce the total marking work. Finally, if used for assessment rather than only for ranking, grade boundaries are superimposed on the rank order.

Controlled studies using professional markers employed by school exam boards have shown that marking in this way gives much higher (NOT lower) reliability, and that for large numbers of scripts, the total time taken is less. The statistics can directly identify when sufficient consensus has been reached; which scripts generate most disagreement (send them for a second and third opinion); and which markers agree least with other markers. Originally designed for higher reliability (repeatability, and so fairness) of marks and reduced costs, it can also collect feedback comments.

The most interesting underlying issue is that APR is in complete contrast to the assessment approach which is currently, by default, the dominant one of breaking a judgement down into separate judgements against multiple explicit criteria, which at least has the virtue of supporting usefully explicit and diagnostic feedback to learners. Instead, APR uses a single complex criterion for marking. However in many ways, real academic values have a large implicit aspect; and furthermore, are holistic instead of being always and simply reductionist. APR is particularly appropriate for portfolios of work, and for work where different students may choose to submit in different media (e.g. printed, web pages, audio tapes).

Further aspects

It should also be noted that APR seems to be, just as Thurstone argued, more natural psychologically. Thus it may be of use to adopt as a method even without any computational support; or when the software simply presents the cases to be compared without optimising how many comparisons are done. Consequently,

Software implementers would probably be well advised if they separated the screen-presentation software module from the statistics and optimisation software.
Educationalists should actively and persistently consider whether APR should be adopted for assessment and evaluation in many contexts, and not just to reduce marking workload or improve reliability on standard assessment types.

Certainly some academics who have read about it, now do paper marking in this way: using pairwise comparisons to sort the whole set into rank order, and then deciding where to put the grade boundaries.

Gonsalvez et al. (2013), in another context, report a method for supervisors to give summative ratings of competence to students on professional field placements, that use a standard set of (4) vignettes (for each of 9 domains). The supervisor selects the vignette that is closest to the student's performance on this aspect.
Their argument and evidence is that this leads to more consistency across raters, and less bias of kinds where supervisors seem prone to rate a student overall and then attribute that mark to that student across all domains. Implying that a disadvantage of numerical scores is that they lead psychologically to LESS precision and discrimination than un-numbered vignettes do.

The Thurstone foundation for APR seems also to have a close link to David Nicol's recent studies of what goes on when students perform (reciprocal) peer critiquing. He finds that when a student generates a judgement or critique of another student's work, when they themselves have just performed the same task e.g. their own version of a piece of courswork, then they absolutely cannot prevent themselves doing an inner comparison of their own work with the other student's (a paired comparison); and that they generate a lot of useful thoughts about their own work from that, even when neither asked nor required to do so (Nicol 2018).

Links to Niall's software

https://learn.gla.ac.uk/niall/ltiacj/index.php

Related software *

Huddersfield
Question Mark
Joe's homegrown lash-up
MyCampus: electronic submission
No More Marking

USPs: My list of the 13 distinct features that make APR / ACJ important

As the number of students goes up, this method of marking "scales" i.e. the time and effort required of human markers to do the marking reduces per student.
Working versions of the software have been built, tested, and used; and by more than one person and in more than one organisation.
(Also constructed and used for conference talk refereeing at Glasgow University.)
A major experiment has been done and published, using professional markers; supporting the key claims (Pollitt, 2012). [* Give some numbers]
This paper additionally reports an important qualitative datum: that the markers were highly sceptical (did the experiment for the money, at standard professonal rates for marking) but came to see it as better as well as faster than their traditional way of doing marking).
The method has a compelling psychological naturalness.
(Not surprising since it derives from an old psychological theory.)
It can easily be arranged to collect comments at the same time e.g. to generate formative feedback as well as summative marks.
It can easily mark cross-media (where different students submit in different media).
It can easily mark multi-media (where each student uses several media e.g. pictures and text). E.g. in portfolios.
It can easily be used for/with unusual, subjective, and implicit marking criteria. E.g. quality of musical performance, competence in professional medical practice; or giving a talk to be judged on the extent to which it "sounds like a professional psychologist".
It can be used with one complex criterion OR remarked on each of several separate criteria.
It can be used by matching against vignettes (carefully created standard examples that new work is matched against). [Gonsalvez et al. 2013] Gosalvez also shows that APR can be used for competence assessment of practical, field activities (e.g. in medicine, vetinary training, ...); and not only for marking to separate students into a ranking of how good each is.
But this is only a novel extension of something likely to be used in many cases: of seeding the new scripts to be assessed, with old scripts selected to stand on grade boundaries; so that the ranking produced by the core APR procedure can be translated into grades.
It can be used with a set of markers, to get them to converge as completely as possible; or to do the overall job as fast as possible when different markers contribute different amounts at different times (as may often be the case with refereeing papers submitted to conferences).
It can be used for judging by teachers OR by peers.
It can be used to see which markers deviate most from the other markers.
It can be used to see which scripts attract the least consistent ratings across markers.
=> I.e. to give a detailed, multi-faceted report on "reliability".

Separate intellectual ideas

Thurstone: a better model of the psychological process of judgement
Nicol: involuntary comparative judgements. (Nicol, 2018)
(All measurement is relative in the end.)
Plimmer: Real task analysis of the detail of what human markers are doing when marking.

References

Bramley & Oates (2010)
Dale, V.H.M & Singer, J. (2019) "Learner experiences of a blended course incorporating a MOOC on Haskell functional programming" Research in Learning Technology vol.27. doi:10.25304/rlt.v27.2248
Gonsalvez, C. J., Bushnell, J., Blackman, R., Deane, F., Bliokas, V., Nicholson-Perry, K., . . . Knight, R. (2013) "Assessment of Psychology Competencies in Field Placements: Standardized Vignettes Reduce Rater Bias" Training and Education in Professional Psychology vol.7 no.2 pp.99-111 doi:10.1037/a0031617
Nicol, David (2018) "Unlocking generative feedback through peer reviewing" ch.5 p.47-59 in V.Grion & A.Serbati (2018) Assessment of learning or assessment for learning? Towards a culture of sustainable assessment in higher education https://www.reap.ac.uk/Portals/101/Documents/PEER/Research/NICOL_Unlocking_published_English.pdf
Plimmer,Beryl & Apperley,M.D. (2007) "Making paperless work" CHINZ '07 Proceedings of the 7th ACM SIGCHI New Zealand chapter's international conference on Computer-human interaction: design centered HCI 2007: pp.1-8 dl.acm.org/citation.cfm?id=1278961 doi:10.1145/1278960.1278961
Plimmer,B. & Mason,P. (2006) "A pen-based paperless environment for annotating and marking student assignments" PROC.7TH AUSTRALASIAN USER INTERFACE CONFERENCE, CRPIT PRESS pp.37-44 http://crpit.com/confpapers/CRPITV50Plimmer.pdf
Pollitt,A. (2004) "Let's stop marking exams" presented at IAEA conference. Also available at:
http://www.cambridgeassessment.org.uk/Images/109719-let-s-stop-marking-exams.pdf
Pollitt,A. (2012) "The method of Adaptive Comparative Judgement" Assessment in Education: Principles, Policy & Practice Vol.19 no.3 pp.281-300 doi:10.1080/0969594X.2012.665354
Correction to an equation (published vol.19, no.3, p.387 doi: 10.1080/0969594X.2012.694697)
Thurstone, L.L. 1927a. "A law of comparative judgment" Psychological Review vol.34 no.4 pp.273-286 [No doi, but online from GU library]
Reprinted in L.L.Thurstone (1959) The measurement of values Chapter 3 (Chicago, IL: University of Chicago Press)
[This is a general and technical statement of "A law of comparative judgment"; with some maths.]
Thurstone, L.L. 1927b. "Psychophysical analysis" The American Journal of Psychology vol.38 no.3 pp.368-89 https://www.jstor.org/stable/1415006 [This discusses at some length its application to perceptual judgements.]
Thurstone, L.L. 1931. "Measurement of change in social attitude" Journal of Social Psychology vol.2 no.2 pp.230-5 doi:10.1080/00224545.1931.9918969 [This briefly describes its use in measuring social attitudes.]

More references (mostly received from Paul Anderson)

Ajjawi, R., & Bearman, M. (2018). "Problematising standards: representation or performance?" ch.4 pp.57-66 in David Boud, Rola Ajjawi, Phillip Dawson, Joanna Tai (eds.) Developing Evaluative Judgement in Higher Education Routledge.
Barrada, J. R., Olea, J., Ponsoda, V., & Abad, F. J. (2010) "A method for the comparison of item selection rules in computerized adaptive testing" Applied Psychological Measurement 34(6), 438-452. doi:10.1177/0146621610370152
Bloxham, S., den-Outer, B., Hudson, J., & Price, M. (2016) "Let's stop the pretence of consistent marking: exploring the multiple limitations of assessment criteria" Assessment & Evaluation in Higher Education, 41(3), 466-481. doi:10.1080/02602938.2015.1024607
Bradley, R. A., & Terry, M. E. (1952). "Rank analysis of incomplete block designs: I. The method of paired comparisons" Biometrika 39(3/4), 324-345. doi:10.2307/2334029
Bramley, T., & Wheadon, C. (2015, November). "The reliability of Adaptive Comparative Judgement" In Paper presented at the AEA-Europe annual conference (Vol. 4, p. 7). cambridgeassessment.org.uk/Images/296241-the-reliability-of-adaptive-comparative-judgment.pdf
cambridgeassessment.org.uk/Images/232694-investigating-the-reliability-of-adaptive-comparative-judgment.pdf
Brinker, C., Mencía, E. L., & Fürnkranz, J. (2014, December) "Graded multilabel classification by pairwise comparisons" In 2014 IEEE International Conference on Data Mining pp. 731-736 IEEE. doi:10.1109/ICDM.2014.102
Buse, R. P., & Weimer, W. R. (2008, July) "A metric for software readability" In Proceedings of the 2008 international symposium on Software testing and analysis (pp. 121-130). ACM. cs.otago.ac.nz/cosc345/resources/Read-Ex-3.pdf Glass,R.L. (2003) "About Education" ch.7 p.181-4 in Facts and Fallacies of Software Engineering Addison-Wesley, Boston, MA.
Hardy, J., Galloway, R., Rhind, S., McBride, K., Hughes, K., & Donnelly, R. "Ask, answer, assess" Higher Education Academy.
Kimbell, R. (2008). "E-assessment in project e-scape" Design and Technology Education: An International Journal, 12(2). jil.lboro.ac.uk/ojs/index.php/DATE/article/download/Journal_12.2_0707_RES6/59
McKenzie, Ross (2018) "Progressive Adaptive Comparative Judgement" Unpublished.
McKenzie, Ross "A python implementation of Progressive Adaptive Comparative Judgement" https://github.com/RossMcKenzie/ACJ
Negahban, S., Oh, S., & Shah, D. (2012). "Iterative ranking from pair-wise comparisons" In Advances in neural information processing systems (pp. 2474-2482). papers.nips.cc/paper/4701-iterative-ranking-from
Relf, P. A. (2004). Achieving software quality through source code readability Quality Contract Manufacturing LLC. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.8894&rep=rep1&type=pdf
Wauthier, F., Jordan, M., & Jojic, N. (2013, February). "Efficient ranking from pairwise comparisons" In International Conference on Machine Learning (pp.109-117). proceedings.mlr.press/v28/wauthier13.pdf

Talks

"Assessment by pairwise ranking" Talk proposed but rejected for GU conference 2013
"From a thousand learners to a thousand markers: Scaling peer feedback with Adaptive Comparative Judgement" A talk at the 2019 GU teaching and learning conference. Slides
"Scaling Assessment with Adaptive Comparative Judgement" A talk at the SRHE Digital University seminar in Edinburgh Online Assessment: design, scale, and creativity Friday, 14 June 2019 [alternative event record]
Slides Note this is a 7MB pdf file. Feel free to redistribute as required.
"From a thousand learners to a thousand markers: Scaling peer feedback with Adaptive Comparative Judgement" Talk at ALT-C conference 2019, Edinburgh

Names, terminology

Adaptive Comparative Judgement (ACJ) is the term used by Pollitt in the most important publication so far. However it describes the software, not the human process nor the psychology of it. The software does the adaptation, the human judges do not. The humans just do pairwise comparisons or (more exactly) ranking.

The theory, following Thurstone, assumes that there is an implicit, psychologically real, scale in the minds of the humans, which is not directly accessible by them (through introspection or reasoning), but reveals itself as a consistent pattern in their judgements.

Furthermore: that this is true of complex judgements, and that these are not helped by attempting to break them down into multiple component criteria which must then be combined again to reach a mark; almost always, by a simplistic arithmetic operation such as adding marks, which generally does not reproduce the academic value judgements actually made by the experts consulted.

TR = Thurstone ranking [Grove-Stephenson]
TP = Thurstone pairs [Mhairi]
TS = Thurstone scaling [a web page on psych methods]
LCJ = Law of comparative judgement (Thurstone's "law")
CJ = Comparative Judgement; "Thurstone's method of CJ". [Pollitt]
TPCJ = Thurstone Paired Comparative Judgements; & "Thurstone's method of CJ".
DCJ = Direct Comparative Judgement [Pollitt]
PC = Pairwise (or Paired) Comparison [Bramley]
ROM = Rank Ordering Method [Bramley]
PCM = Pairwise (or Paired) Comparison Methods [Bramley]
*APR = Assessment by Pairwise Ranking
ADCJ = Assessment by Direct Comparative Judgement [Pollitt]
PRS = Pairwise Ranking Scales
PRTS = Pairwise Ranking Thurstone Scales
PCR = Pairwise Comparative Ranking [my term, but is best, avoids abbrev. PC]
PCRS = Pairwise Comparative Ranking Scales [my term]
ACJ = Adaptive Comparative Judgement [common]

Currently preferred terms:
APR for general overall process; or
ACJ for the software-optimised cost-saving version. CJs for individual judgements.

N.B.  comparative & judgement are redundant [not quite because judgement can be absolute]
comparative & ranking are redundant [TRUE]
Strictly, Thurstone scaling produces a set which produces more than ranking: an
interval scale.

Part 2: Beyond Pollitt's work

A vision. Or a wish, anyway.

I'd like to see a UI (user interface) allowing use of the APR process without displaying the objects on screen (e.g. handwritten exam scripts): just have a unique ID on a post-it on each script, and the software selecting pairs to compare, receiving the judgements, receiving the formative comments if any. While many things can be usefully digitised and presented on a computer, this will probably never be true of all the objects we may need to judge.

Here's another thought I'd like to share. Pollitt's work made a deep impression on me, in making me think about the real psychological marking process; and how better to work with it, and support it. But I learned another big lesson, which might perhaps be combined with APR, from Beryl Plimmer (ref.s above):

Plimmer worked in HCI (Human Computer Interaction) and did a real, proper task analysis of what is involved in marking a heap of first year programming assignments. Firstly: the student submissions were zip files, with many parts (code, output of the test suite, documentation ....). This is another case where current user interfaces (UIs) to ACJ engines just won't cut it. Plimmer's software opened all the zipped files at once in separate windows (and probably had careful window/pane management).

Secondly, she recognised that the marker had to input multiple DIFFERENT responses, in DIFFERENT documents (and formats), and for DIFFERENT audiences: the mark to admin people; feedback to the student; markups on the code itself, ..... And she used a single pen input device (and then character recognition to turn it into digital stuff) to save the user switching from mouse to keyboard constantly.

Thirdly, this made me reflect on my own marking (of piles of essays in psychology) and why I never felt software would help because I use a big table (enough for at least four but preferably six people to sit at) exactly because there is so much stuff that needs to be visible/writeable at once, and computer screens are so pathetically small compared to a table. (But Plimmer shows that you CAN use a screen but that special code for opening and managing the windows automatically makes a considerable difference.)

In fact, in my own case I generally have four "outputs" on different bits of paper:

Comments to return to the student
My private comments to myself e.g. to use when discussing my marks with a 2nd marker.
Mark sheet to the Admin staff
A sheet with any ideas and refs the student used that I want to follow up for myself, because sometimes students have things to teach me, and/or trigger off new thoughts in me that, as a subject specialist, I don't want to forget. Exams, especially in final year options, often cause learning in the teacher.

Joe's lashup

Related to Plimmer's insights into improving software to improve the marking task is how some individuals have created substantial software improvements.

Plimmer's paper gives a published description of her carefully designed system for marking year 1 programming assignments in her dept. Joe Maguire's practice is a different kind of course and marking; implemented quite differently; but done out of ready to hand software adapted to a well-personalised system for one marker on one course. He has used a personalised, and personally created, lashup for increasing quality and speed of his marking for 3? years now.

Software

"PDF expert": a not-free pdf Viewer with some additional fns and now on iPad, iPhone, Mac.
Handwriting, pencil tool for signatures, edit text, "stamps".
iPhone XR (to support pencil and handwriting recog?)
Cloud storage of some kind. Esp. university's own for privacy
Apple sync: recently better at sync-ing his iPhone, iPad, desk Mac.
Handwriting by apple pen (in Pdf docs).
Safari multi-window pdf viewers.

Teacher-level functions supported

Comment bank. ?In word doc/pdf? "Stamps" in PDF expert: you can just superpose them on any place in any PDF doc. Essentally a short comment you define. Usually one word, can resize each stamp to taste. This is essentially another form of comment bank. [Text editing in PDF documents "like word".]
Cloud and syncing allows him to do bits of marking anywhere. Instead of having to carve out large lumps of uninterrupted time, can work on it a bit; save with a few pointers (e.g. underlining) about where he is in a task, then resume. Not degrading but allowing less pressure on time by making more bits of time usable; and much cheaper pause/resume of the task.
Multiple outputs from the marker: marking up student docs; private word doc to self with private notes (e.g. to be used in discussion); Create personalised comments to each student (partly from the comment bank);
He has a doc that is a form with the rubric and grade descriptors, duplicates it for each student, and marks up the form to apply to that student per criterion.
Each year goes back to the comment bank, and reviews what he might do better in the course (to fend off common errors by students). The ample cloud storage makes this easy to afford and to do.

Back to broadening our vision for future APR/ACJ designs

This is not only illuminating but shows that the ACJ user interfaces (UIs) up to now could with profit be seriously expanded. It's been shown that ACJ can usefully include the collection and use of formative feedback comments. However this doesn't tackle the real, full needs of marking in general. When assessing complex work (portfolios, sculptures, music ...), it is likely that the assessor will want to make extensive notes to self; and in later rounds of comparisons, re-view not the original work, but their notes on it.

So: I think there are really several distinct directions in which Pollitt's work has contributed, yet could be taken further.

A) Algorithms that cut down the work of marking by making economies of scale in comparisons.

B) More reliable (accurate, repeatable) marking.

C) Psych. insight into how assessment really works in the human mind; and hence how to be more realistic about it, and more aware of what it does well vs. poorly.

D) Consequent educational ideas about how to understand and improve assessment processes with both lower costs and higher benefits.

E) Further educational applications. E.g. if software was good enough and cheap enough, we could run exercises with students as the markers to:

Train them to make the same judgements as experts, where appropriate and important.
Show them how judges do and don't differ from each other in judging the student's own level.
Possibly, show students how their judgements are much better (though less confident) than they think. ....

This is an example of how learning often involves becoming expert, which in turn involves not just learning concepts and facts, and how to reason about them; but also learning new perceptual skills, and new execution skills (fast sentence composition, ....).

F) We now know quite a lot about the desirable software requirements that wasn't known before:

The modularisation / architecture: e.g. separation of object presentation; choice of which pair to look at; and the other aspects of the UID (user interface design).
Flexibility in display, and whether to display the work at all.
Variations in the algorithm; how to deal with worst case ordering of judgements e.g. a rogue marker; or an unusual object; or one wrong judgement early on.
The need for multiple outputs from the marker, at least in some cases; and how to make a much more efficient UI for this.
How to handle multiple inputs from one learner product as in Plimmer's zip files.

A paper on requirements might be a real contribution – see the next section.

Software architecture

With hindsight, I would now (March 2019) recommend that software for APR be organised into four well-separated major modules.

The display part of the user interface.
The "scripts" may be displayed on screen, or not. This module must be ready for dealing with a display of nothing, or one document (the canonical case), or many documents. The latter is not an exotic case. Plimmer (2006) dealt with student submissions of 5 related documents, just for level 1 computer programming. Furthermore, they should be displayed in a carefully designed window with several panes of different custom sizes.
Displaying "nothing" i.e. having an input-only interface would be useful in going round a sculpture park, assessing the exhibits which are much larger than the computer. However also common in such cases is displying a token reminder e.g. a thumbnail, to remind the user which case they are judging / comparing. A common variant of this will be when the user observes each case separately and makes notes on them; then these notes should be displayed during comparison phase, to remind the user. So the display module needs to be able to accept documents from the users (judges) to display here (not just from the authors or students being assessed). An example could be the sculpture park; but also (for example) judging which is the best PhD of the year, where reading/ skimming each PhD would be done in an earlier phase, followed by mulitiple comparisons and judgements.
The user input part of the user interface.
Another big insight from Plimmer is that the output of marking is OFTEN multiple output documents (marks vs comments in many cases; but also notes for later use in moderation discussions, or private notes on ideas to follow up, stimulated by a good exam answer). AND that using a single input device is a major advantage (rather than switching repeatedly between mouse, keyboard, stylus).
Statistical calculations engine
1. Calculation of best estimates of ordering and distances. 2. Calculating the orderings of objects, of markers, ....

* Follow a ref. to a related literature in psychophysics. The "staircase" procedure there is also about the "smartest" way to zero in on small differences between measurements. Lamming. (papers in HEA subject area ? .... Kingdom and Prins book "psychophysics"; ch. on "Adaptive methods" GU library: http://encore.lib.gla.ac.uk/iii/encore/record/C__Rb2749176

Decision-making and Optimisation of the combined human-software job
(about what pairs should be offered for comparison at each point)
This module chooses which pair to present next for judgement, given what is known at this point in the judging process. This is the key function for reducing effort per script to be assessed. This needs to have broader functionality than early software to allow differing modes e.g to cope with unknown availability of markers, how to use a new marker coming on board later than others, etc.

Next jobs or studies to do

A different kind of study would be to study people hand-ranking small-ish samples of objects. Could study differences between people; study it from the viewpoint of implicit concepts and judgements. And detect who has developed good judgement or judgement of a certain kind. As an assessment of their (implicit) judgement skill. The ACJ software can comment on differences between judges, and so could be part of a system to judge their judgements.

* ...