AI = Assessment Innovation

Artificial intelligence seems poised to change the game in so many different areas, and psychometrics is no exception. For me, one of the most exciting and immediate applications for AI is the ability to generate test questions. In considering how consequential a shift to AI based item development might represent for assessment; I invite readers of this blog to consider a not-so-far-off reality: what would the world of assessment look like if content development were no longer a bottleneck and item banks were practically limitless?

1.        Item development —> Item validation

When AIs are able to write high quality test questions, the role of the SME changes from author to validator. That is, the usual stages of training, authoring, reviewing, revising, and approving turn into just approving. And if no particular effort goes into creating items, there is no need to invest time into shaping them; when a generated item doesn’t meet (objective and/or idiosyncratic) standards, the SME can simply choose another.

2.        Requirements for SMEs shrink

Some time ago, I created a model of item development that estimated the average time to bring an item to experimental status, or ready for field testing, at about 2 hours of SME time[1]. This assumed a rejection rate of about 1/3. Assuming this rate goes up somewhat (let’s just say it doubles, though preliminary data suggests this is too high) and that 3 SMEs each take 10 minutes to review every AI-generated item (again, likely too high), an average of 50 minutes of SME time would be invested for every accepted item. My sense is that an operational testing program with experienced item reviewers and refinements in the quality of AI prompts will reduce this time considerably.

3.        Test security concerns (almost) disappear

When content is readily available, multiple examination forms can readily be created. As a result, test security issues can either be proactively avoided by making available multiple versions of exams, or quickly managed after a security incident, at least to the extent that exposed items can easily be replaced. Of course, having a large item bank of calibrated items is the holy grail, and we’re not quite there yet[2], but there are ways to limit the number of calibrated items required, see #4.

4.        New cheaper and better testing models will emerge

Infinite content should cause everyone to rethink assumptions about their testing models. One model that Spire is working with is to use only new content for each and every form, except for anchor item requirements (around 25%). These new items would be selected for scoring after administration to be content-balanced overall and to simultaneously maximize the value of reliability. This model would require an exam with enough items to be able to reject 15% - 20% of them, an assumption met by most current exams with embedded experimental items. This model might even improve test quality and processes since only the best items would be selected and key validation would likely prove unnecessary.

 

Each one of this four items individually is enough to disrupt business as usual in the testing industry, but they’ll all happen at once. And I could go on, as I will, in later blogs. In the meantime, suffice to say that we live in exciting times!!

 

 

 


[1] Including metadata, but NOT rationales for correct and incorrect options, which AI readily produces

[2] Although it might be the golden age of Assessment Engineering which could decrease reliance on after-the-fact calibration.

Previous
Previous

Oh, and by the way, we also assessed your ELC

Next
Next

Competence from Entry to Practice to across the Career Span