★How can we measure the effectiveness of an AI tutor?
Measuring human-learner performance
We can of course try to directly measure the learner’s performance, e.g. with explicit tests and exams - but these are effortful for the learner, inefficient, and don’t capture everything that matters.
We can measure the learner’s engagement and enjoyment, either explicitly (with surveys) or implicitly (by their continued interest). This makes some sense, in that a learner won't learn much if they give up. But on the flip side, it's possible to be engaged in something that isn't efficacious. So, engagement is necessary but not sufficient for learning.
We might even try and quantify aspects of the learner’s behaviour with automated LLM evals, e.g. whether the learner is exhibiting curiosity, or asking good questions.
Of course, even if we could measure all these dimensions at a given moment, it’s not easy to tell how much to attribute changes over time to the teacher. We have expensive gold-standard measurements (e.g. longitudinal between-subject AB tests/RCTs), but we can't run these very often. So in practice, we have to rely mostly on cheap & immediate automated evaluations, after validating them occasionally against the gold standard - kinda like applying the cosmic distance ladder approach to product development, where we try and find a chain of proximal measures that ladder up to our expensive, gold-standard distal measure.
Measuring teacher performance
We can also try and measure the teacher’s behaviour, as in the LearnLM paper, where the model is tuned towards important dimensions, like “be encouraging”, “don’t give away the answer prematurely”, and “keep the learner on track”.
Beyond these, there are many lower-level pedagogical best practices we might consider, e.g. encouraging an incremental mindset, making the content memorable, applying a spiral curriculum, use of analogies and examples, etc.
And of course there are various product metrics, e.g. latency, ease of use.
AI offers new, more accurate, more humane ways to evaluate
We can also consider more speculative approaches that are too labour-intensive for human teachers:
-
A deep dialogue with a teacher (almost like a low-stakes, ongoing PhD viva) provides a very rich measure of a learner’s handle on the material - the teacher can constantly probe, ask questions at the boundaries of the learner’s knowledge, and ask them to apply what they have learned in unexpected ways. Asking questions at the margins of the learner’s knowledge like this will maximise the informational payoff of each question to the teacher. It’s too expensive to have a human teacher discussing with the learner all the time, but this may be one of the ways that an AI teacher could develop a very rich sense of a learner’s ability over time.
-
"If you can't explain it simply, you don't understand it well enough” (Feynman). We can discover a great deal about the learner’s understanding by asking them to teach someone else (either a human or AI peer), and noticing where in the material they succeed and where they struggle as a teacher. And as a nice side-effect, both these activities will also be very effective for helping the learner to learn.
-
If we have built a rich, explicit model of the learner, as above, then we could use the simulated learner’s performance as a proxy measure for the real learner’s ability - perhaps in the future, our grade will be based on how well our avatars do on a multi-day battery of simulated exams!