Last changed 6 Oct 2021 ............... Length about 5,000 words (41,000 bytes).
(Document started on 18 Mar 2016.) This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/apr/apr.html. You may copy it. How to refer to it.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [apr] [this page] [popup instructions] [ex1] [ex2] [Niall Barr's software]

Assessment by pairwise ranking (a.k.a. "ACJ")

Brief contents list


Assessment by pairwise ranking (APR), also referred to by various other terms e.g. ACJ, has lately from the school sector as a radical departure in assessment, only recently made feasible by technology.

In APR, instead of reading each student script once and deciding its mark, markers see pairs of scripts and decide which should rank above the other on a single complex criterion. Software assembles these pairwise judgements into an ordering (an ordinal scale). However by applying Thurstone's "law of comparative judgement" the software further calculates a quantitative interval scale -- allowing it to calculate what further comparisons will yield the most or least additional information, and so allow optimisations that for large numbers (20 scripts is a very rough threshold) reduce the total marking work. Finally, if used for assessment rather than only for ranking, grade boundaries are superimposed on the rank order.

Controlled studies using professional markers employed by school exam boards have shown that marking in this way gives much higher (NOT lower) reliability, and that for large numbers of scripts, the total time taken is less. The statistics can directly identify when sufficient consensus has been reached; which scripts generate most disagreement (send them for a second and third opinion); and which markers agree least with other markers. Originally designed for higher reliability (repeatability, and so fairness) of marks and reduced costs, it can also collect feedback comments.

The most interesting underlying issue is that APR is in complete contrast to the assessment approach which is currently, by default, the dominant one of breaking a judgement down into separate judgements against multiple explicit criteria, which at least has the virtue of supporting usefully explicit and diagnostic feedback to learners. Instead, APR uses a single complex criterion for marking. However in many ways, real academic values have a large implicit aspect; and furthermore, are holistic instead of being always and simply reductionist. APR is particularly appropriate for portfolios of work, and for work where different students may choose to submit in different media (e.g. printed, web pages, audio tapes).

Further aspects

It should also be noted that APR seems to be, just as Thurstone argued, more natural psychologically. Thus it may be of use to adopt as a method even without any computational support; or when the software simply presents the cases to be compared without optimising how many comparisons are done. Consequently,

Certainly some academics who have read about it, now do paper marking in this way: using pairwise comparisons to sort the whole set into rank order, and then deciding where to put the grade boundaries.

Gonsalvez et al. (2013), in another context, report a method for supervisors to give summative ratings of competence to students on professional field placements, that use a standard set of (4) vignettes (for each of 9 domains). The supervisor selects the vignette that is closest to the student's performance on this aspect.
Their argument and evidence is that this leads to more consistency across raters, and less bias of kinds where supervisors seem prone to rate a student overall and then attribute that mark to that student across all domains. Implying that a disadvantage of numerical scores is that they lead psychologically to LESS precision and discrimination than un-numbered vignettes do.

The Thurstone foundation for APR seems also to have a close link to David Nicol's recent studies of what goes on when students perform (reciprocal) peer critiquing. He finds that when a student generates a judgement or critique of another student's work, when they themselves have just performed the same task e.g. their own version of a piece of courswork, then they absolutely cannot prevent themselves doing an inner comparison of their own work with the other student's (a paired comparison); and that they generate a lot of useful thoughts about their own work from that, even when neither asked nor required to do so (Nicol 2018).

Links to Niall's software

https://learn.gla.ac.uk/niall/ltiacj/index.php

Related software *

USPs: My list of the 13 distinct features that make APR / ACJ important

Separate intellectual ideas

References

More references (mostly received from Paul Anderson)

Talks

Names, terminology

Adaptive Comparative Judgement (ACJ) is the term used by Pollitt in the most important publication so far. However it describes the software, not the human process nor the psychology of it. The software does the adaptation, the human judges do not. The humans just do pairwise comparisons or (more exactly) ranking.

The theory, following Thurstone, assumes that there is an implicit, psychologically real, scale in the minds of the humans, which is not directly accessible by them (through introspection or reasoning), but reveals itself as a consistent pattern in their judgements.

Furthermore: that this is true of complex judgements, and that these are not helped by attempting to break them down into multiple component criteria which must then be combined again to reach a mark; almost always, by a simplistic arithmetic operation such as adding marks, which generally does not reproduce the academic value judgements actually made by the experts consulted.

TR = Thurstone ranking [Grove-Stephenson]
TP = Thurstone pairs [Mhairi]
TS = Thurstone scaling [a web page on psych methods]
LCJ = Law of comparative judgement (Thurstone's "law")
CJ = Comparative Judgement; "Thurstone's method of CJ". [Pollitt]
TPCJ = Thurstone Paired Comparative Judgements; & "Thurstone's method of CJ".
DCJ = Direct Comparative Judgement [Pollitt]
PC = Pairwise (or Paired) Comparison [Bramley]
ROM = Rank Ordering Method [Bramley]
PCM = Pairwise (or Paired) Comparison Methods [Bramley]
*APR = Assessment by Pairwise Ranking
ADCJ = Assessment by Direct Comparative Judgement [Pollitt]
PRS = Pairwise Ranking Scales
PRTS = Pairwise Ranking Thurstone Scales
PCR = Pairwise Comparative Ranking [my term, but is best, avoids abbrev. PC]
PCRS = Pairwise Comparative Ranking Scales [my term]
ACJ = Adaptive Comparative Judgement [common]

Currently preferred terms:
APR for general overall process; or
ACJ for the software-optimised cost-saving version. CJs for individual judgements.

N.B.  comparative & judgement are redundant [not quite because judgement can be absolute]
comparative & ranking are redundant [TRUE]
Strictly, Thurstone scaling produces a set which produces more than ranking: an
interval scale.

Part 2: Beyond Pollitt's work

A vision. Or a wish, anyway.

I'd like to see a UI (user interface) allowing use of the APR process without displaying the objects on screen (e.g. handwritten exam scripts): just have a unique ID on a post-it on each script, and the software selecting pairs to compare, receiving the judgements, receiving the formative comments if any. While many things can be usefully digitised and presented on a computer, this will probably never be true of all the objects we may need to judge.

Here's another thought I'd like to share. Pollitt's work made a deep impression on me, in making me think about the real psychological marking process; and how better to work with it, and support it. But I learned another big lesson, which might perhaps be combined with APR, from Beryl Plimmer (ref.s above):

Plimmer worked in HCI (Human Computer Interaction) and did a real, proper task analysis of what is involved in marking a heap of first year programming assignments. Firstly: the student submissions were zip files, with many parts (code, output of the test suite, documentation ....). This is another case where current user interfaces (UIs) to ACJ engines just won't cut it. Plimmer's software opened all the zipped files at once in separate windows (and probably had careful window/pane management).

Secondly, she recognised that the marker had to input multiple DIFFERENT responses, in DIFFERENT documents (and formats), and for DIFFERENT audiences: the mark to admin people; feedback to the student; markups on the code itself, ..... And she used a single pen input device (and then character recognition to turn it into digital stuff) to save the user switching from mouse to keyboard constantly.

Thirdly, this made me reflect on my own marking (of piles of essays in psychology) and why I never felt software would help because I use a big table (enough for at least four but preferably six people to sit at) exactly because there is so much stuff that needs to be visible/writeable at once, and computer screens are so pathetically small compared to a table. (But Plimmer shows that you CAN use a screen but that special code for opening and managing the windows automatically makes a considerable difference.)

In fact, in my own case I generally have four "outputs" on different bits of paper:

  1. Comments to return to the student
  2. My private comments to myself e.g. to use when discussing my marks with a 2nd marker.
  3. Mark sheet to the Admin staff
  4. A sheet with any ideas and refs the student used that I want to follow up for myself, because sometimes students have things to teach me, and/or trigger off new thoughts in me that, as a subject specialist, I don't want to forget. Exams, especially in final year options, often cause learning in the teacher.

Joe's lashup

Related to Plimmer's insights into improving software to improve the marking task is how some individuals have created substantial software improvements.

Plimmer's paper gives a published description of her carefully designed system for marking year 1 programming assignments in her dept. Joe Maguire's practice is a different kind of course and marking; implemented quite differently; but done out of ready to hand software adapted to a well-personalised system for one marker on one course. He has used a personalised, and personally created, lashup for increasing quality and speed of his marking for 3? years now.

Software

  1. "PDF expert": a not-free pdf Viewer with some additional fns and now on iPad, iPhone, Mac.
    Handwriting, pencil tool for signatures, edit text, "stamps".
  2. iPhone XR (to support pencil and handwriting recog?)
  3. Cloud storage of some kind. Esp. university's own for privacy
  4. Apple sync: recently better at sync-ing his iPhone, iPad, desk Mac.
  5. Handwriting by apple pen (in Pdf docs).
  6. Safari multi-window pdf viewers.

Teacher-level functions supported

  1. Comment bank. ?In word doc/pdf? "Stamps" in PDF expert: you can just superpose them on any place in any PDF doc. Essentally a short comment you define. Usually one word, can resize each stamp to taste. This is essentially another form of comment bank. [Text editing in PDF documents "like word".]
  2. Cloud and syncing allows him to do bits of marking anywhere. Instead of having to carve out large lumps of uninterrupted time, can work on it a bit; save with a few pointers (e.g. underlining) about where he is in a task, then resume. Not degrading but allowing less pressure on time by making more bits of time usable; and much cheaper pause/resume of the task.
  3. Multiple outputs from the marker: marking up student docs; private word doc to self with private notes (e.g. to be used in discussion); Create personalised comments to each student (partly from the comment bank);
  4. He has a doc that is a form with the rubric and grade descriptors, duplicates it for each student, and marks up the form to apply to that student per criterion.
  5. Each year goes back to the comment bank, and reviews what he might do better in the course (to fend off common errors by students). The ample cloud storage makes this easy to afford and to do.

Back to broadening our vision for future APR/ACJ designs

This is not only illuminating but shows that the ACJ user interfaces (UIs) up to now could with profit be seriously expanded. It's been shown that ACJ can usefully include the collection and use of formative feedback comments. However this doesn't tackle the real, full needs of marking in general. When assessing complex work (portfolios, sculptures, music ...), it is likely that the assessor will want to make extensive notes to self; and in later rounds of comparisons, re-view not the original work, but their notes on it.

So: I think there are really several distinct directions in which Pollitt's work has contributed, yet could be taken further.

A) Algorithms that cut down the work of marking by making economies of scale in comparisons.

B) More reliable (accurate, repeatable) marking.

C) Psych. insight into how assessment really works in the human mind; and hence how to be more realistic about it, and more aware of what it does well vs. poorly.

D) Consequent educational ideas about how to understand and improve assessment processes with both lower costs and higher benefits.

E) Further educational applications. E.g. if software was good enough and cheap enough, we could run exercises with students as the markers to:

  1. Train them to make the same judgements as experts, where appropriate and important.
  2. Show them how judges do and don't differ from each other in judging the student's own level.
  3. Possibly, show students how their judgements are much better (though less confident) than they think. ....
This is an example of how learning often involves becoming expert, which in turn involves not just learning concepts and facts, and how to reason about them; but also learning new perceptual skills, and new execution skills (fast sentence composition, ....).

F) We now know quite a lot about the desirable software requirements that wasn't known before:

  1. The modularisation / architecture: e.g. separation of object presentation; choice of which pair to look at; and the other aspects of the UID (user interface design).
  2. Flexibility in display, and whether to display the work at all.
  3. Variations in the algorithm; how to deal with worst case ordering of judgements e.g. a rogue marker; or an unusual object; or one wrong judgement early on.
  4. The need for multiple outputs from the marker, at least in some cases; and how to make a much more efficient UI for this.
  5. How to handle multiple inputs from one learner product as in Plimmer's zip files.

A paper on requirements might be a real contribution – see the next section.

Software architecture

With hindsight, I would now (March 2019) recommend that software for APR be organised into four well-separated major modules.
  1. The display part of the user interface.
    The "scripts" may be displayed on screen, or not. This module must be ready for dealing with a display of nothing, or one document (the canonical case), or many documents. The latter is not an exotic case. Plimmer (2006) dealt with student submissions of 5 related documents, just for level 1 computer programming. Furthermore, they should be displayed in a carefully designed window with several panes of different custom sizes.

    Displaying "nothing" i.e. having an input-only interface would be useful in going round a sculpture park, assessing the exhibits which are much larger than the computer. However also common in such cases is displying a token reminder e.g. a thumbnail, to remind the user which case they are judging / comparing. A common variant of this will be when the user observes each case separately and makes notes on them; then these notes should be displayed during comparison phase, to remind the user. So the display module needs to be able to accept documents from the users (judges) to display here (not just from the authors or students being assessed). An example could be the sculpture park; but also (for example) judging which is the best PhD of the year, where reading/ skimming each PhD would be done in an earlier phase, followed by mulitiple comparisons and judgements.

  2. The user input part of the user interface.
    Another big insight from Plimmer is that the output of marking is OFTEN multiple output documents (marks vs comments in many cases; but also notes for later use in moderation discussions, or private notes on ideas to follow up, stimulated by a good exam answer). AND that using a single input device is a major advantage (rather than switching repeatedly between mouse, keyboard, stylus).
  3. Statistical calculations engine
    1. Calculation of best estimates of ordering and distances. 2. Calculating the orderings of objects, of markers, ....

    *
    Follow a ref. to a related literature in psychophysics. 
    The "staircase" procedure there is also about the "smartest" way to zero in on
    small differences between measurements.
    Lamming. (papers in HEA subject area ? ....
    Kingdom and Prins book "psychophysics"; ch. on "Adaptive methods"
    GU library:
    http://encore.lib.gla.ac.uk/iii/encore/record/C__Rb2749176
    
  4. Decision-making and Optimisation of the combined human-software job
    (about what pairs should be offered for comparison at each point)
    This module chooses which pair to present next for judgement, given what is known at this point in the judging process. This is the key function for reducing effort per script to be assessed. This needs to have broader functionality than early software to allow differing modes e.g to cope with unknown availability of markers, how to use a new marker coming on board later than others, etc.

Next jobs or studies to do

A different kind of study would be to study people hand-ranking small-ish samples of objects. Could study differences between people; study it from the viewpoint of implicit concepts and judgements. And detect who has developed good judgement or judgement of a certain kind. As an assessment of their (implicit) judgement skill. The ACJ software can comment on differences between judges, and so could be part of a system to judge their judgements.

* ...

Misc. notes: people, software, ...

Niall Barr's software implementing APR / ACJ

[Should this be moved somewhere else?]
Link to my notes on (using) Niall Barr's software implementing APR / ACJ

Category, ordinal, interval, ratio-scale

See notes on this topic at this page: http://www.psy.gla.ac.uk/~steve/best/ordinal.html

Associated people

ALT (assoc of ...) (Cardiff) University of South Wales . 7 miles north of Cardiff; Treforest, near Pontypridd. Edinburgh Newcastle Glasgow

Web site logical path: [www.psy.gla.ac.uk] [~steve] [apr] [this page]
[Top of this page]