Logo Logo
Hilfe
Hilfe
Switch Language to English

Seßler, Kathrin; Fürstenberg, Maurice ORCID logoORCID: https://orcid.org/0009-0001-1090-9299; Bühler, Babette und Kasneci, Enkelejda (2025): Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3. - 7. März 2025. Association for Computing Machinery (Hrsg.), In: LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference, Association for Computing Machinery: New York, NY, United States. S. 462-472 [PDF, 752kB]

[thumbnail of Paper_in_Proceedings.pdf]
Vorschau
Creative Commons: Namensnennung 4.0 (CC-BY)
Veröffentlichte Version

Abstract

The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance (e.g. alignment and reliability) of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1-preview, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs’ scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The o1 model outperforms all other LLMs, achieving Spearman’s = .74 with human assessments in the Overall score, and an internal consistency of = .80, though biased towards higher scores. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency to overrate and their remaining issues to capture the content quality, the models require further refinement.

Dokument bearbeiten Dokument bearbeiten