Web site logical path: [www.psy.gla.ac.uk] [~steve] [grumps] [this page]
We might look at a strategic alliance with other researchers who are relatively expert in mining: e.g. get them to analyse our data, while we adapt our collection mechanisms to suit them better.
Phil and I will mull over this. We can consider progressing this a bit more in a month or two.
He creates CAL for medical students, and in some ways seems to have Malcolm's attitude to this: collecting huge amounts of data, mining it, and presuming this must yield gold any time now.
In his talk, I remember:
MIT has put all its "materials" on line: his ambition is to build what he calls a "learning management system" (I think he already has a version running) that adds in what the materials alone don't: personal records, matching curriculum demands to materials ... and generally keep track of every little thing students do. I think he's nuts, but you probably don't. Also interesting: he automated a substantial test for some part of the medical course (anatomy?) (having already got 3D visualisation for anatomy to make it more learnable), and had a rushed but interesting account of how students started to cheat on this by collaborating and how he used the detailed records to show when this happened, backed up by webcams (after the cheating had got established) that turned out to be important in showing that the fastest time wasn't due to cheating after all but was by the best student.
So: we probably want to see and hear more of this guy; and discuss educ-related data mining with him.
The subject will be: all the data we gathered in the first study. The aim:
We'll form a subgroup within Grumps to pursue this. First meeting on Friday in Julie's office for me and Murray, at least, to start getting familiar with the data and imagining what we might want to ask. Next meeting in about 10 days, invite Peter Hay to come and advise/lead us doing analyses and/or using the mining tool Julie has now installed.
My personal view is that Phil's first pass at this and his diagram is hopeless because it mixes several quite different types of thing: in my view there are several interacting, but logically independent, dimensions or aspects:
Note too that we need one set of terms to describe software configurations in Grumps: particular designed deployments of data collection units; and another term to describe attempts to analyse the data. Because these two activities can be (and are in this, our first little example activity) separate and not united by a single goal or prior question.
But the real insight is the quite close analogy to "information need" as used in the IR field to distinguish, in Mizzaro's terms, between real, perceived, expressed, and formal information needs:
Presumably, we should really be able to characterise any particular investigation on this TD/BU dimension, and be able to say how to use Grumps for each case and for those in between.
What we have is data collected mainly because it was there: and so collected in terms of the software on which it is parasitic e.g. keystrokes, "window events" which turn out not to be what users see as windows but part of the Microsoft software architecture. The first job is to re-attach human meaning to it wherever possible.
Examples: missing UAR timestamps or rather their most significant parts, so can we recover these by inference from other logs? Some recorded times are only relative to the start of a session: can we calculate absolute times by consulting the different log of concentrator startups/sessions? On a broader scale, we may have login IDs but these are only valid for a year: the table of users must be captured by us within this time limit. IDs for tutors and students overlap (re-use the same ID space) so we must save both tables and also get a record of whether a record relates to a tutor or to a student (from machine ID?: lab PC vs. handheld for tutor).
I want to argue that most of this job is about recovering meaning independent of the particular questions and information needs. It's a separate job or stage in our process. It is often not informed by what we think we want to know; but just about what is going on in the situation that generates the data i.e. the human understanding of that situation (context), that the software doesn't understand but the people do. And this meaning can and should be reconstructed first, early on, partly in order to allow BU and data mining spotting of new patterns.
In general, I believe this amounts to writing an ER diagram of the (human-informed) situation in which the data is generated, relating the data actually collected to that diagram, and then as far as possible arranging for extra data to be acquired in order to relate the data to the entities that are in the diagram and are humanly meaningful. This amounts to capturing enough from the context to restore meaning; and using analysis such as ER diagrams as part of the method for doing this. Thus we do analysis after the design of the data collection, while classically it is done in advance; although we may then have to change our collection or alternatively do mass data restructuring before any real analysis guided by our information need/questions can be done.
I want to argue the independence of recovering meaning for grumps data (i.e. converting data into information) from the information need /Question. But the retort to this is that, well: we think perhaps the faculty the student is in might be predictive of failure (within our Question), so we redo our data capture to record or retrieve that. But conceiving that, and executing it, relies on what all the humans involved know, but the software does not, about this domain independently of the particular Question. So perhaps what we want is to view ER elicitation as part of this process step: that is, not only retrieving meaning from the context but from the human heads. This analysis, classically done as design elicitation, will have to be part of the Grumps process: eliciting domain knowledge as part of setting up an "experiment" i.e. an investigation deploying grumps collections.
Web site logical path:
[Top of this page]