ORCID: https://orcid.org/0009-0001-1090-9299; Bühler, Babette und Kasneci, Enkelejda
(2025):
Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring.
LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3. - 7. März 2025.
Association for Computing Machinery (Hrsg.),
In: LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference,
Association for Computing Machinery: New York, NY, United States. S. 462-472
[PDF, 752kB]

Abstract
The manual assessment and grading of student writing is a time-consuming yet critical task for teachers. Recent developments in generative AI offer potential solutions to facilitate essay-scoring tasks for teachers. In our study, we evaluate the performance (e.g. alignment and reliability) of both open-source and closed-source LLMs in assessing German student essays, comparing their evaluations to those of 37 teachers across 10 pre-defined criteria (i.e., plot logic, expression). A corpus of 20 real-world essays from Year 7 and 8 students was analyzed using five LLMs: GPT-3.5, GPT-4, o1-preview, LLaMA 3-70B, and Mixtral 8x7B, aiming to provide in-depth insights into LLMs’ scoring capabilities. Closed-source GPT models outperform open-source models in both internal consistency and alignment with human ratings, particularly excelling in language-related criteria. The o1 model outperforms all other LLMs, achieving Spearman’s = .74 with human assessments in the Overall score, and an internal consistency of = .80, though biased towards higher scores. These findings indicate that LLM-based assessment can be a useful tool to reduce teacher workload by supporting the evaluation of essays, especially with regard to language-related criteria. However, due to their tendency to overrate and their remaining issues to capture the content quality, the models require further refinement.
Dokumententyp: | Konferenzbeitrag (Paper) |
---|---|
Keywords: | Large Language Models; Automated Essay Scoring; Learning Analytics; Education |
Fakultät: | Sprach- und Literaturwissenschaften > Department 1 > Germanistik > Fachdidaktik |
Themengebiete: | 400 Sprache > 410 Linguistik |
URN: | urn:nbn:de:bvb:19-epub-126938-0 |
ISBN: | 979-8-4007-0701-8 |
Ort: | Association for Computing Machinery |
Sprache: | Englisch |
Dokumenten ID: | 126938 |
Datum der Veröffentlichung auf Open Access LMU: | 07. Aug. 2025 10:29 |
Letzte Änderungen: | 07. Aug. 2025 10:29 |
Literaturliste: | [1] AI@Meta. 2024. Llama 3 Model Card. [2] Hani Alers, Aleksandra Malinowska, Gregory Meghoe, and Enso Apfel. 2024. Using ChatGPT-4 to Grade Open Question Exams. In Advances in Information and Communication, Kohei Arai (Ed.). Springer, Springer Nature Switzerland, Cham, 1–9. [3] Anthropic. 2023. Claude. https://www.anthropic.com/ Kathrin Seßler, Maurice Fürstenberg, Babette Bühler, and Enkelejda Kasneci [4] Xiaoyu Bai and Manfred Stede. 2023. A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring. International Journal of Artificial Intelligence in Education 33, 4 (2023), 992–1030. [5] Majdi Beseiso, Omar A Alzubi, and Hasan Rashaideh. 2021. A novel automated essay scoring approach for reliable higher educational assessments. Journal of Computing in Higher Education 33 (2021), 727–746. [6] Arne Bewersdorff, Christian Hartmann, Marie Hornberger, Kathrin Seßler, Maria Bannert, Enkelejda Kasneci, Gjergji Kasneci, Xiaoming Zhai, and Claudia Nerdel. 2024. Taking the next step with generative artificial intelligence: The transformative role of multimodal large language models in science education. arXiv:2401.00832 [7] Arne Bewersdorff, Kathrin Seßler, Armin Baur, Enkelejda Kasneci, and Claudia Nerdel. 2023. Assessing student errors in experimentation using artificial in- telligence and large language models: A comparative study with human raters. Computers and Education: Artificial Intelligence 5 (2023), 100177. [8] Shravya Bhat, Huy Anh Nguyen, Steven Moore, John C Stamper, Majd Sakr, and Eric Nyberg. 2022. Towards Automated Generation and Evaluation of Questions in Educational Domains.. In EDM, Antonija Mitrovic and Nigel Bosch (Eds.). International Educational Data Mining Society, Durham, United Kingdom, 701– 704. [9] Peter Birkel and Claudia Birkel. 2002. Wie einig sind sich Lehrer bei der Auf- satzbeurteilung? Eine Replikationsstudie zur Untersuchung von Rudolf Weiss. Psychologie in Erziehung und Unterricht, 49, 3 (2002), 219–224. [10] Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. TOEFL11: A Corpus of Non-Native English. Technical Report. Educational Testing Service. [11] Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Al- ternative to Human Evaluations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15607–15631. [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. [13] Afrizal Doewes and Mykola Pechenizkiy. 2021. On the Limitations of Human- Computer Agreement in Automated Essay Scoring.. In The 14th International Conference on Educational Data Mining (EDM21). International Educational Data Mining Society, Paris, France, 475–480. [14] Jonas Flodén. 2024. Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal 00 (2024), 1–24. [15] Google Gemini Team. 2024. Gemini: A Family of Highly Capable Multimodal Models. [16] Arthur C Graesser, Danielle S McNamara, Max M Louwerse, and Zhiqiang Cai. 2004. Coh-Metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers 36, 2 (2004), 193–202. [17] Jürgen Grzesik and Michael Fischer. 1984. Was leisten Kriterien für die Aufsatzbeurteilung?: Theoretische, empirische und praktische Aspekte des Ge- brauchs von Kriterien und der Mehrfachbeurteilung nach globalem Ersteindruck. Westdeutscher-Verlag, Wiesbaden, Germany. [18] Veronika Hackl, Alexandra Elena Müller, Michael Granitzer, and Maximilian Sailer. 2023. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings. Frontiers in Education 8, 1272229. [19] Ben Hamner, Jaison Morgan, Iynnvandev, Mark Shermis, and Tom Vander Ark. 2012. The hewlett foundation: Automated essay scoring. https://www.kaggle. com/c/asap-aes [20] Hendrik Haverkamp, Malte Hecht, and Kirsten Schindler. 2024. Lernförderliches Feedback KI-basiert vermitteln. Der Deutschunterricht 5 (2024). [21] Hyangeun Ji, Insook Han, and Yujung Ko. 2023. A systematic review of conver- sational AI in language education: Focusing on the collaboration with human teachers. Journal of Research on Technology in Education 55, 1 (2023), 48–63. [22] AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. 2023. Mistral 7B (2023). Technical Report. Mistral AI. [23] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. Technical Report. Mistral AI. [24] Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and individual differences 103 (2023), 102274. [25] Zixuan Ke and Vincent Ng. 2019. Automated Essay Scoring: A Survey of the State of the Art.. In Proceedings of the Twenty-Eighth International Joint Conference 471 Can AI grade your essays? on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 6300–6308. [26] Terry K Koo and Mae Y Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine 15, 2 (2016), 155–163. [27] Kultusministerkonferenz. 2023. Lehrkräfteeinstellungsbedarf und -angebot in der Bundesrepublik Deutschland 2023 – 2035: Zusammengefasste Modellrechnungen der Länder. Dokumentation 238. Sekretariat der Ständigen Konferenz der Kultus- minister der Länder in der Bundesrepublik Deutschland, Berlin. Beschluss der Kultusministerkonferenz vom 08.12.2023. [28] Gyeong-Geon Lee, Ehsan Latif, Xuansheng Wu, Ninghao Liu, and Xiaoming Zhai. 2024. Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence 6 (2024), 100213. [29] Dogan Gursoy Mesut Cicek and Lu Lu. 2024. Adverse impacts of revealing the presence of “Artificial Intelligence (AI)” technology in product and service descriptions on purchase intentions: the mediating role of emotional trust and the moderating role of perceived risk. Journal of Hospitality Marketing & Management 0, 0 (2024), 1–23. [30] Atsushi Mizumoto and Masaki Eguchi. 2023. Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics 2, 2 (2023), 100050. [31] Leo Morjaria, Levi Burns, Keyna Bracken, Anthony J Levinson, Quang N Ngo, Mark Lee, and Matthew Sibbald. 2024. Examining the Efficacy of ChatGPT in Marking Short-Answer Assessments in an Undergraduate Medical Program. International Medical Education 3, 1 (2024), 32–43. [32] Nora Müller, Till Utesch, and Vera Busse. 2023. Qualität statt Quantität? Zum Zusammenhang von Schreibförderungs-und Feedbackpraktiken mit Textqualität unter Berücksichtigung von migrationsbedingter Mehrsprachigkeit. Unt.wiss. Zeits. f. Lernforschung 51, 2 (2023), 169–198. [33] Sonia Alejandrina Sotelo Muñoz, Giovanna Gutiérrez Gayoso, Alberto Caceres Huambo, Rogelio Domingo Cahuana Tapia, Jorge Layme Incaluque, Oscar Ed- uardo Pongo Aguila, Juan Cielo Ramírez Cajamarca, Jesus Enrique Reyes Acevedo, Herbert Victor Huaranga Rivera, and José Luis Arias-Gonzáles. 2023. Examining the impacts of ChatGPT on student motivation and engagement. Social Space 23, 1 (2023), 1–27. [34] Frank Mußmann, Martin Riethmüller, Thomas Hardwig, Stefan Peters, Marcel Parciak, Ilka Charlotte Ohms, and Stefan Klötzer. 2016. Niedersächsische Arbeit- szeitstudie 2015 / 2016: Lehrkräfte an öffentlichen Schulen Ergebnisbericht. Technical Report. Georg-August-Universität Göttingen, Kooperationsstelle Hochschulen und Gewerkschaften, Humboldtallee 15, D-37073 Göttingen. https://doi.org/10. 3249/webdoc-3971 Redaktionell überarbeitet im Dezember 2016. [35] Ben Naismith, Phoebe Mulcaire, and Jill Burstein. 2023. Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack, Victoria Yaneva, Zheng Yuan, and Torsten Zesch (Eds.). Association for Computational Linguistics, Toronto, Canada, 394–403. [36] Tanya Nazaretsky, Moriah Ariely, Mutlu Cukurova, and Giora Alexandron. 2022. Teachers’ trust in AI-powered educational technology and a professional devel- opment program to improve it. British journal of educational technology 53, 4 (2022), 914–931. [37] Huy A Nguyen, Hayden Stec, Xinying Hou, Sarah Di, and Bruce M McLaren. 2023. Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In European Conference on Technology Enhanced Learning, Olga Viberg, Ioana Jivet, Pedro J. Muñoz-Merino, Maria Perifanou, and Tina Papathoma (Eds.). Springer Nature Switzerland, Cham, 278–293. [38] OpenAI. 2023. GPT-4 technical report. Technical Report. OpenAI. [39] OpenAI. 2024. OpenAI o1 System Card. Technical Report. OpenAI. [40] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744. LAK 2025, March 03–07, 2025, Dublin, Ireland [41] Ulrike Padó, Yunus Eryilmaz, and Larissa Kirschner. 2023. Short-Answer Grad- ing for German: Addressing the Challenges. International Journal of Artificial Intelligence in Education (2023), 1–32. [42] Gustavo Pinto, Isadora Cardoso-Pereira, Danilo Monteiro, Danilo Lucena, Alberto Souza, and Kiev Gama. 2023. Large language models for education: Grading open- ended questions using chatgpt. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering. Association for Computing Machinery, New York, NY, USA, 293–302. [43] Sanna Pohlmann-Rother, Edgar Schoreit, and Anja Kürzinger. 2016. Schreibkom- petenzen von Erstklässlern quantitativ-empirisch erfassen-Herausforderungen und Zugewinn eines analytisch-kriterialen Vorgehens gegenüber einer holistis- chen Bewertung. Journal for educational research online 8, 2 (2016), 107–135. [44] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9. [45] Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review 55, 3 (2022), 2495–2527. [46] Jörg Sawatzki, Tim Schlippe, and Marian Benner-Wickner. 2021. Deep learning techniques for automatic short answer grading: Predicting scores for English and German answers. In International conference on artificial intelligence in education technology, Eric C. K. Cheng, Rekha B. Koul, Tianchong Wang, and Xinguo Yu (Eds.). Springer Nature Singapore, Singapore, 65–75. [47] Pauline Schröter, Hannelore Söldner, Lars Hoffmann, Anja Riemenschnei- der, Jörg Jost, and Dorothee Wieser. 2022. Wie vergleichbar sind die Bew- ertungen von Abiturarbeiten im Fach Deutsch? Empirische Studien zu ver- schiedenen Bewertungsmodellen. In Das unvergleichliche Abitur: Entwicklungen– Herausforderungen–Empirische Analysen, Lars Hoffmann, Pauline Schröter, Alexander Groß, Svenja Schmid-Kühn, and Petra Stanat (Eds.). wbv Bielefeld, Bielefeld, Germany, 213–250. [48] Kathrin Seßler, Tao Xiang, Lukas Bogenrieder, and Enkelejda Kasneci. 2023. Peer: Empowering writing with large language models. In Responsive and Sustainable Educational Futures, Olga Viberg, Ioana Jivet, Pedro J. Muñoz-Merino, Maria Perifanou, and Tina Papathoma (Eds.). Springer Nature Switzerland, Cham, 755– 761. [49] Maja Stahl, Leon Biermann, Andreas Nehring, and Henning Wachsmuth. 2024. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Gen- eration. arXiv:2404.15845 [cs.CL] [50] Chul Sung, Tejas Indulal Dhamecha, and Nirmal Mukhi. 2019. Improving short answer grading using transformer-based pre-training. In Artificial Intelligence in Education, Seiji Isotani, Eva Millán, Amy Ogan, Peter Hastings, Bruce McLaren, and Rose Luckin (Eds.). Springer International Publishing, Cham, 469–481. [51] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. Technical Report. GenAI, Meta. [52] Masaki Uto, Yikuan Xie, and Maomi Ueno. 2020. Neural automated essay scor- ing incorporating handcrafted features. In Proceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 6077–6088. [53] Hester van Herk, Ype H. Poortinga, and Theo M. M. Verhallen. 2004. Response Styles in Rating Scales: Evidence of Method Bias in Data From Six EU Countries. Journal of Cross-Cultural Psychology 35, 3 (2004), 346–360. [54] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [55] Jin Xue, Xiaoyi Tang, and Liyan Zheng. 2021. A hierarchical BERT-based transfer learning approach for multi-dimensional essay scoring. Ieee Access 9 (2021), 125403–125415. [56] Na Zhai and Xiaomei Ma. 2022. Automated writing evaluation (AWE) feedback: a systematic investigation of college students’ |