How can we measure the effectiveness of an AI tutor?

Measuring human-learner performance

We can of course try to directly measure the learner’s performance, e.g. with explicit tests and exams - but these are effortful for the learner, inefficient, and don’t capture everything that matters.

We can measure the learner’s engagement and enjoyment, either explicitly (with surveys) or implicitly (by their continued interest). This makes some sense, in that a learner won't learn much if they give up. But on the flip side, it's possible to be engaged in something that isn't efficacious. So, engagement is necessary but not sufficient for learning.

We might even try and quantify aspects of the learner’s behaviour with automated LLM evals, e.g. whether the learner is exhibiting curiosity, or asking good questions.

Of course, even if we could measure all these dimensions at a given moment, it’s not easy to tell how much to attribute changes over time to the teacher. We have expensive gold-standard measurements (e.g. longitudinal between-subject AB tests/RCTs), but we can't run these very often. So in practice, we have to rely mostly on cheap & immediate automated evaluations, after validating them occasionally against the gold standard - kinda like applying the cosmic distance ladder approach to product development, where we try and find a chain of proximal measures that ladder up to our expensive, gold-standard distal measure.

Measuring teacher performance

We can also try and measure the teacher’s behaviour, as in the LearnLM paper, where the model is tuned towards important dimensions, like “be encouraging”, “don’t give away the answer prematurely”, and “keep the learner on track”.

Beyond these, there are many lower-level pedagogical best practices we might consider, e.g. encouraging an incremental mindset, making the content memorable, applying a spiral curriculum, use of analogies and examples, etc.

And of course there are various product metrics, e.g. latency, ease of use.

AI offers new, more accurate, more humane ways to evaluate

We can also consider more speculative approaches that are too labour-intensive for human teachers:



Belongs to these tags