Last changed 31 Oct 2001 ............... Length about 2,500 words (16,000 bytes).
This is a WWW document maintained by Steve Draper, installed at http://www.psy.gla.ac.uk/~steve/grumps/dmining.html.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [grumps] [this page]

Grumps data mining

Here's some notes on ideas towards getting a data mining and interpretation initiative going within Grumps. We need, within Grumps as a whole, to close the circle between collecting data, discovering interesting information, answering questions, (re)designing data collection dynamically. This is a first small "study" to get started on this side of the overall Grumps subject.

Overall Plan

Our current plan is:

Analyse the data we collected in Grumps study one around May 2001, as a first case for us.
We didn't collect this data with a firm question in mind, but for this investigation our current question (information need) is "What predicts students failing?" (in this class, ....).
3 Dec 2001 is the deadline for a smallish UK conference "EASE". We aim to submit to EASE
- One or more papers reporting on this analysis
- At least one "workshop": provide a small sample of data, have attendees discuss how they would analyse it, what they think it might mean.
Probably not for EASE, write a paper on the process (not findings), particularly the human and/or HCI process, of analysing the data. We are collecting observations and diaries during this study with a view to this. In fact in my view (as in one of the messages recorded below) the top goal here would be to aim at producing and publishing The Glasgow Method (of data mining investigations). Within that, two distinct subthemes seem important:
- [PDG] The user experience; how hard it is to learn and to use and to cope with the variety of tools, software, skills needed to do data mining currently. So for Julie, its about the diff DBMS, spreadsheets, the big size and overnight batch runs, ...
- [SWD] The variety of sources of data and using these to triangulate to recover meaning as part of the process. Whether the biggest but unexpected stage in the process really is attaching meaning to the data, regardless of the question/info need. So: spreadsheets containing class assessment info, UAR, LSS, some standard tables e.g. password file, IP lists, ...

Message 1. [2 Aug. 2001]

I've been chatting to Phil and here for the email archives is a possible way to go for Grumps. The biggest "idea" is at (3) below.

Prologue

Our current "plan" for progressing the data mining and interpretation aspect is led by Julie, who over the summer will be doing data cleaning, and looking at interpreting the data we have from study 1. Then we have Quintin signed up as a client with a genuine independent interest over the next year in interpreting data on level1 DCS students from multiple sources. Finally, we agreed to collaborate in putting together a small but high value reading list, to share some baseline knowledge of mining.

The problem

But there's a lot of specialists in data mining; and more than one kind of mining each with its specialists. Are we sensible to compete with them, given that none of us know much, and our main resource in Grumps is Murray who has not declared a decision to change his main research interest to this? Perhaps we need in the medium term (not this month) to develop a strategy about this.

An idea

Perhaps we could decide to major in a different angle on this, with three parts.

Combining multiple data sources and types (most data miners don't do this apparently).
Explicitly exploring, comparing, and constrasting (in each application domain) multiple approaches to data interpretation. Here is a rough 4-way categorisation:
1. Pre-design the data you want to collect for a specific purpose (classical database design; LSS is our example of doing this). I.e. you know in advance the interpretations you want, and create the data just for that.
2. Collect side-effect data that directly corresponds with what you know you want to know e.g. login events.
3. Collect side-effect data whose direct meaning is relatively unproblematic, but whose further utility may not be e.g. Richard's stuff on collecting command usage data, then applying models of learning and expertise.
4. Full-on mining: collect any old data that's available, and search later for i) possible meanings ii) possible utility.
The Grumps contribution would be to explore ALL of these in parallel, and examine pros and cons, etc.
Develop an explicit Method (captial M) for interpretation that uses a loop of looking at the data and getting more information for understanding it by interviewing the "subjects" i.e. humans who created the data as a side-effect. I.e. regard this not just as something we might do during development, but as a permanent part of the process. This would differentiate us from other researchers, and play to our unique strengths not our weakness at data mining. It fits directly with what Quintin will probably be doing over the next year. (He wants eventually to predict dropout-endangered students. I think he'll have to interview many students, as the biggest reasons for dropouts aren't computational but finance and personal circs.)

Smaller features

We can probably get Karen Renaud interested in work closely related to this, perhaps exploring ways to interpret the data we've warehoused.

We might look at a strategic alliance with other researchers who are relatively expert in mining: e.g. get them to analyse our data, while we adapt our collection mechanisms to suit them better.

Phil and I will mull over this. We can consider progressing this a bit more in a month or two.

Message 2. [3 Sept. 2001]

I don't have time to consider this further right now, but some of you should probably take note of a person I met last week at a workshop: Jem Rashbass: http://www.cbcu.cam.ac.uk/cbcu/staff/person.asp?person_ID=2 I introduced myself, and warned him Grumps would probably be in contact soon.

He creates CAL for medical students, and in some ways seems to have Malcolm's attitude to this: collecting huge amounts of data, mining it, and presuming this must yield gold any time now.

In his talk, I remember:
MIT has put all its "materials" on line: his ambition is to build what he calls a "learning management system" (I think he already has a version running) that adds in what the materials alone don't: personal records, matching curriculum demands to materials ... and generally keep track of every little thing students do. I think he's nuts, but you probably don't. Also interesting: he automated a substantial test for some part of the medical course (anatomy?) (having already got 3D visualisation for anatomy to make it more learnable), and had a rushed but interesting account of how students started to cheat on this by collaborating and how he used the detailed records to show when this happened, backed up by webcams (after the cheating had got established) that turned out to be important in showing that the fastest time wasn't due to cheating after all but was by the best student.

So: we probably want to see and hear more of this guy; and discuss educ-related data mining with him.

Message 3. [26 Sept. 2001]

We will aim to submit a paper by 3 Dec to the EASE conference at Keele, and use this as a self-imposed first milestone in getting this aspect going. In fact, we'll aim to submit 2 things: at least one paper, and a "workshop" type non-compliant proposal for a session where we bring along a small bit of data and lead/provoke a discussion on the different way sit could be interepreted. (David Budgen was taken with this suggestion I made to him; I've seen it done successfully a couple of times in other kinds of conference. A pretty small bit of data is more than enough; people get actively involved; and the presenters get lots of alternative views and suggestions about approaches.)

The subject will be: all the data we gathered in the first study. The aim:

Try out various alternative analyses, and develop a discussion of the pros and cons (and interactions) of each. I'd provisionally suggest expect doing analyses at about 4 levels, which vary in how directly they are meaningful to our obvious a priori concerns. E.g.
1. Login events: directly interesting and relevant; interpretation has hardly any problems
2. Application switches: almost unproblematic, except you can't tell whether elapsed time represents someone working or talking on the phone.
3. Command usage analyses (cf. Richard's work)
4. Mining the lowlevel events
Study the method, and HCI issues of doing such analyses. This is Phil's suggestion: record what we do and how it works out; what is the user's (analyst's) experience in this domain; how tools do and don't support our efforts.

We'll form a subgroup within Grumps to pursue this. First meeting on Friday in Julie's office for me and Murray, at least, to start getting familiar with the data and imagining what we might want to ask. Next meeting in about 10 days, invite Peter Hay to come and advise/lead us doing analyses and/or using the mining tool Julie has now installed.

Dimensions of data investigation

Triggered by Phil's starting notion and diagram, some important ideas emerged in the meeting of 18 Oct. These follow from trying to understand or frame the process we are muddling into.

My personal view is that Phil's first pass at this and his diagram is hopeless because it mixes several quite different types of thing: in my view there are several interacting, but logically independent, dimensions or aspects:

What is the activity and what to call it?
Experiment / investigation / study / information need.
It isn't an experiment (as we called it in Grumps) because this data's collection in particular was NOT designed to answer a pre-conceived question or hypothesis. We used "study" already. We might call it an investigation.
Note too that we need one set of terms to describe software configurations in Grumps: particular designed deployments of data collection units; and another term to describe attempts to analyse the data. Because these two activities can be (and are in this, our first little example activity) separate and not united by a single goal or prior question.
But the real insight is the quite close analogy to "information need" as used in the IR field to distinguish, in Mizzaro's terms, between real, perceived, expressed, and formal information needs:
- The information the person needs from an external viewpoint, but may not be able to describe even to themselves (partly because need depends not just on the person but also on the structure of the world which they may not yet appreciate). I.e. their actual need as understood by God, but perhaps not by them.
- The need they can perceive and perhaps recognise, but not necessarily articulate. I.e. their current implicit understanding of their own need.
- The need they can express in natural language. I.e. their current explicit understanding of their own need.
- The query they are able to communicate to the software.
So:
- We want two sets of terms here: for data collection, and for data analysis activities.
- Pay attention, within the data analysis side, to the 4-way IR distinction between real, perceived, expressed, and formal information needs.
The people, the actions, the roles.
- Main group of information-need Inquirers; perhaps defined by their question i.e. information need (e.g., for us here and now, "Why do students drop out of or fail the level 1 CompSci course?").
- Consultants, who have related information needs now or in the past, and who will therefore have stimulating related questions and answers (for us e.g. Bill Patrick, Alison Mitchell, Richard Thomas).
- Intermediaries (this is IR/ library terminology): technical experts who can help you formulate your search, operate the resources, tell you where to look and what to ask, run your searches for you. (For us: possibly Peter Hay?)
Top down / bottom up (TD/BU).
Whether an investigation is organised from the question to the data collection; or vice versa. In real life, that is in terms of the real historical chronology of human actions, there is probably always a mixture. But in logic, and so in the structure of the arguments published and in the explicit plans people try to organise their activities with, there is a big difference. Classical database work is at a TD extreme: the structure of the data is designed before any is collected, and the analysis methods are all to do with getting that design right; and the technology is famously poor at dealing with the unexpected questions, needs, and cases that crop up after design time. Data mining is at the opposite BU end: you have this data collected for quite other purposes: now what if anything can you infer or extract from it? Grumps is about 3/4 the way towards BU, but addresses being dynamic (unlike both the others): focus first on instrumenting something more than having a prior question, but then organise to be able to change the collection as easily as possible, presumably under the impact of changing (understandings of) questions you would like the data to answer.
Presumably, we should really be able to characterise any particular investigation on this TD/BU dimension, and be able to say how to use Grumps for each case and for those in between.
Meaning
I want to suggest, as a major lesson for me already from this study, that a very important issue, aspect, dimension here is that of restoring meaning to the data collected. This is a first and essential step before any other analysis can be done. There is a classic distinction between data / information / knowledge (/ wisdom). And there is a notion of doing data cleaning before analysis. But I want to add enormously more emphasis here.
What we have is data collected mainly because it was there: and so collected in terms of the software on which it is parasitic e.g. keystrokes, "window events" which turn out not to be what users see as windows but part of the Microsoft software architecture. The first job is to re-attach human meaning to it wherever possible.
Examples: missing UAR timestamps or rather their most significant parts, so can we recover these by inference from other logs? Some recorded times are only relative to the start of a session: can we calculate absolute times by consulting the different log of concentrator startups/sessions? On a broader scale, we may have login IDs but these are only valid for a year: the table of users must be captured by us within this time limit. IDs for tutors and students overlap (re-use the same ID space) so we must save both tables and also get a record of whether a record relates to a tutor or to a student (from machine ID?: lab PC vs. handheld for tutor).
I want to argue that most of this job is about recovering meaning independent of the particular questions and information needs. It's a separate job or stage in our process. It is often not informed by what we think we want to know; but just about what is going on in the situation that generates the data i.e. the human understanding of that situation (context), that the software doesn't understand but the people do. And this meaning can and should be reconstructed first, early on, partly in order to allow BU and data mining spotting of new patterns.
In general, I believe this amounts to writing an ER diagram of the (human-informed) situation in which the data is generated, relating the data actually collected to that diagram, and then as far as possible arranging for extra data to be acquired in order to relate the data to the entities that are in the diagram and are humanly meaningful. This amounts to capturing enough from the context to restore meaning; and using analysis such as ER diagrams as part of the method for doing this. Thus we do analysis after the design of the data collection, while classically it is done in advance; although we may then have to change our collection or alternatively do mass data restructuring before any real analysis guided by our information need/questions can be done.
I want to argue the independence of recovering meaning for grumps data (i.e. converting data into information) from the information need /Question. But the retort to this is that, well: we think perhaps the faculty the student is in might be predictive of failure (within our Question), so we redo our data capture to record or retrieve that. But conceiving that, and executing it, relies on what all the humans involved know, but the software does not, about this domain independently of the particular Question. So perhaps what we want is to view ER elicitation as part of this process step: that is, not only retrieving meaning from the context but from the human heads. This analysis, classically done as design elicitation, will have to be part of the Grumps process: eliciting domain knowledge as part of setting up an "experiment" i.e. an investigation deploying grumps collections.
Time scale.
The time scale over which the analysis needs to be done and re-displayed. Data mining classically this is years; but for detecting students at risk of failing in future this must be in a month or two; and from some LSS applications it might be hours or minutes. Also, capturing enough additional contextual data to give meaning usually has a time limit to it.

Web site logical path: [www.psy.gla.ac.uk] [~steve] [grumps] [this page]
[Top of this page]