Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design
Abstract¶
Short answer questions (SAQs) are useful in educational testing but they can be resource-intensive to grade at scale. Large language models (LLMs) have seen marked improvement over recent years, and with many high-quality models commercially available, there is an opportunity to leverage LLMs to aid SAQ scoring. This study explored the performance of several off-the-shelf LLMs at automatically scoring responses to High School Math and English SAQs. LLMs rated responses using three rubrics, and performance was compared to human scores. Results showed that LLM performance improved given more detailed rubrics. For many items across both subjects, several models had near-perfect prediction, while other items showed lower levels of performance. This paper demonstrates that many LLMs can accurately score SAQs, and performance is influenced by item features and rubric design.
Topics: Large Language Models, Automated Scoring, Educational Assessment
- Frohn, S., Burleigh, T., & Chen, J. (2025). Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In Artificial Intelligence in Education (pp. 44–51). Springer Nature Switzerland. 10.1007/978-3-031-98465-5_6