Last changed 14 April 2019 ............... Length about 4,000 words (32,000 bytes).
(Document started on 18 Mar 2016.) This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/apr/apr.html. You may copy it. How to refer to it.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [apr] [this page] [popup instructions] [ex1] [ex2] [Niall Barr's software] [Talk abstract]

Assessment by pairwise ranking (a.k.a. "ACJ")

Assessment by pairwise ranking (APR), also referred to by various other terms e.g. ACJ, has recently emerged from the school sector as a radical departure in assessment, only recently made feasible by technology.

In APR, instead of reading each student script once and deciding its mark, markers see pairs of scripts and decide which should rank above the other on a single complex criterion. Software assembles these pairwise judgements into a quantitative interval scale (based on Thurstone's "law of comparative judgement"). Finally, if used for assessment rather than only for ranking, grade boundaries are superimposed on the rank order.

Controlled studies using professional markers employed by school exam boards have shown that marking in this way gives much higher (NOT lower) reliability, and that for large numbers of scripts, the total time taken is less. The statistics can directly identify when sufficient consensus has been reached; which scripts generate most disagreement (send them for a second and third opinion); and which markers agree least with other markers. Originally designed for higher reliability (repeatability, and so fairness) of marks and reduced costs, it can also collect feedback comments.

The most interesting underlying issue is that APR is in complete contrast to the assessment approach which is currently, by default, the dominant one of breaking a judgement down into separate judgements against multiple explicit criteria, which at least has the virtue of supporting usefully explicit and diagnostic feedback to learners. Instead, APR uses a single complex criterion for marking. However in many ways, real academic values have a large implicit aspect; and furthermore, are holistic instead of being always and simply reductionist. APR is particularly appropriate for portfolios of work, and for work where different students may choose to submit in different media (e.g. printed, web pages, audio tapes).

Further aspects

It should also be noted that APR seems to be, just as Thurstone argued, more natural psychologically. Thus it may be of use to adopt as a method even without any computational support; or when the software simply presents the cases to be compared without optimising how many comparisons are done. Consequently,

Certainly some academics who have read about it, now do paper marking in this way: using pairwise comparisons to sort the whole set into rank order, and then deciding where to put the grade boundaries.

Gonsalvez et al. (2013), in another context, report a method for supervisors to give summative ratings of competence to students on professional field placements, that use a standard set of (4) vignettes (for each of 9 domains). The supervisor selects the vignette that is closest to the student's performance on this aspect.
Their argument and evidence is that this leads to more consistency across raters, and less bias of kinds where supervisors seem prone to rate a student overall and then attribute that mark to that student across all domains. Implying that a disadvantage of numerical scores is that they lead psychologically to LESS precision and discrimination than un-numbered vignettes do.

The Thurstone foundation for APR seems also to have a close link to David Nicol's recent studies of what goes on when students perform (reciprocal) peer critiquing. He finds that when a student generates a judgement or critique of another student's work, when they themselves have just performed the same task e.g. their own version of a piece of courswork, then they absolutely cannot prevent themselves doing an inner comparison of their own work with the other student's (a paired comparison); and that they generate a lot of useful thoughts about their own work from that, even when neither asked nor required to do so.

Links to Niall's software

https://learn.gla.ac.uk/niall/ltiacj/index.php

Related software

USPs: my list of the distinct features that make APR / ACJ important

Separate intellectual ideas

References

More references (received from Paul Anderson)

Names, terminology

Adaptive Comparative Judgement (ACJ) is the term used by Pollitt in the most important publication so far. However it describes the software, not the human process nor the psychology of it. The software does the adaptation, the human judges do not. The humans just do pairwise comparisons or (more exactly) ranking.

The theory, following Thurstone, assumes that there is an implicit, psychologically real, scale in the minds of the humans, which is not directly accessible by them (through introspection or reasoning), but reveals itself as a consistent pattern in their judgements.

Furthermore: that this is true of complex judgements, and that these are not helped by attempting to break them down into multiple component criteria which must then be combined again to reach a mark; almost always, by a simplistic arithmetic operation such as adding marks, which generally does not reproduce the academic value judgements actually made by the experts consulted.

TR = Thurstone ranking [Grove-Stephenson]
TP = Thurstone pairs [Mhairi]
TS = Thurstone scaling [a web page on psych methods]
LCJ = Law of comparative judgement (Thurstone's "law")
CJ = Comparative Judgement; "Thurstone's method of CJ". [Pollitt]
TPCJ = Thurstone Paired Comparative Judgements; & "Thurstone's method of CJ".
DCJ = Direct Comparative Judgement [Pollitt]
PC = Pairwise (or Paired) Comparison [Bramley]
ROM = Rank Ordering Method [Bramley]
PCM = Pairwise (or Paired) Comparison Methods [Bramley]
*APR = Assessment by Pairwise Ranking
ADCJ = Assessment by Direct Comparative Judgement [Pollitt]
PRS = Pairwise Ranking Scales
PRTS = Pairwise Ranking Thurstone Scales
PCR = Pairwise Comparative Ranking [my term, but is best, avoids abbrev. PC]
PCRS = Pairwise Comparative Ranking Scales [my term]
ACJ = Adaptive Comparative Judgement [common]

Currently preferred terms:
APR for general overall process; or
ACJ for the software-optimised cost-saving version. CJs for individual judgements.

N.B.  comparative & judgement are redundant [not quite because judgement can be absolute]
comparative & ranking are redundant [TRUE]
Strictly, Thurstone scaling produces a set which produces more than ranking: an
interval scale.

Part 2

A vision. Or a wish, anyway.

I'd like to see a UI (user interface) allowing use of the APR process without displaying the objects on screen (e.g. handwritten exam scripts): just have a unique ID on a post-it on each script, and the software selecting pairs to compare, receiving the judgements, receiving the formative comments if any. While many things can be usefully digitised and presented on a computer, this will probably never be true of all the objects we may need to judge.

Here's another thought I'd like to share. Pollitt's work made a deep impression on me, in making me think about the real psychological marking process; and how better to work with it, and support it. But I learned another big lesson, which might perhaps be combined with APR, from Beryl Plimmer (ref.s above):

Plimmer worked in HCI (Human Computer Interaction) and did a real, proper task analysis of what is involved in marking a heap of first year programming assignments. Firstly: the student submissions were zip files, with many parts (code, output of the test suite, documentation ....). This is another case where current UIs to ACJ engines just won't cut it. Plimmer's software opened all the zipped files at once in separate windows (and probably had careful window/pane management).

Secondly, she recognised that the marker had to input multiple DIFFERENT things in response, in DIFFERENT documents (and formats), and for DIFFERENT audiences: the mark to admin people; feedback to the student; markups on the code itself, ..... And she used a single pen input device (and then character recognition to turn it into digital stuff) to save the user switching from mouse to keyboard constantly.

Thirdly, this made me reflect on my own marking (of piles of essays in psychology) and why I never felt software would help because I use a big table (enough for at least 4 but preferably 6 people to sit at) exactly because there is so much stuff that needs to be visible/writeable at once, and computer screens are so pathetically small compared to a table. (But Plimmer shows that you CAN use a screen but that special code for opening and managing the windows automatically makes a considerable difference.)

In fact, in my own case I generally have 4 "outputs" on different bits of paper:

  1. Comments to return to the student
  2. My private comments to myself e.g. to use when discussing my marks with a 2nd marker.
  3. Mark sheet to the Admin staff
  4. A sheet with any ideas and refs the student used that I want to follow up for myself, because sometimes students have things to teach me, and/or trigger off new thoughts in me that, as a subject specialist, I don't want to forget. Exams, especially in final year options, often cause learning in the teacher.

This is not only illuminating but shows that the ACJ UIs up to now could with profit be seriously expanded. It's been shown that ACJ can usefully include the collection and use of formative feedback comments. It doesn't tackle the real, full needs of marking in general. When assessing complex work (portfolios, sculptures, music ...), it is likely that the assessor will want to make extensive notes to self; and in later rounds of comparisons, re-view not the original work, but their notes on it.

So: I think there are really several distinct directions in which Pollitt's work has contributed, and could be taken further.

A) Algorithms that cut down the work of marking by making economies of scale in comparisons.

B) More reliable (accurate, repeatable) marking.

C) Psych. insight into how assessment really works in the human mind; and hence how to be more realistic about it, and more aware of what it does well vs. poorly.

D) Consequent educational ideas about how to understand and improve assessment processes with both lower costs and higher benefits.

E) Further educational applications. E.g. if software was good enough and cheap enough, we could run exercises with students as the markers to:

  1. Train them to make the same judgements as experts, where appropriate and important.
  2. Show them how judges do and don't differ from each other.
  3. Possibly, show students how their judgements are much better (though less confident) than they think. ....
This is an example of how learning often involves becoming expert, which in turn involves not just learning concepts and facts, and how to reason about them; but also learning new perceptual skills, and new execution skills (fast sentence composition, ....).

F) We now know quite a lot about the software requirements, that wasn't known before:

  1. The modularisation / architecture: e.g. separation of object presentation; choice of which pair to look at; and the other aspects of the UID.
  2. Flexibility in display, and whether to display it at all.
  3. Variations in the algorithm; how to deal with worst case ordering of judgements e.g. a rogue marker; or an unusual object; or one wrong judgement early on.
  4. The need for multiple outputs from the marker, at least in some cases; and how to make a much more efficient UI for this.
  5. How to handle multiple inputs from one learner product as in Plimmer's zip files.

A paper on requirements might really be contribution ....

Software architecture

With hindsight, I would now (March 2019) recommend that software for APR is organised into 4 well-separated major modules.

Niall Barr's software implementing APR / ACJ

[Should this be moved somewhere else?]
Link to my notes on (using) Niall Barr's software implementing APR / ACJ

Category, ordinal, interval, ration-scale

Associated people

Edinburgh Glasgow

Web site logical path: [www.psy.gla.ac.uk] [~steve] [apr] [this page]
[Top of this page]