With the growing interest in generative AI (GenAI) for language assessment, its potential as a rater has been discussed. This study compares trained human raters’ scores with GenAI ratings in assessing L2 pragmatic speaking performance across different task types. Fifty L2 English learners of varying proficiency levels completed pragmatic speaking test items, which were scored by five trained raters and ChatGPT-5. To examine the comparability, many-facet Rasch measurement was employed, focusing on examinees’ abilities, raters’ severity, item difficulty, and rating criteria functioning. Findings indicated a moderate correlation between GenAI and human ratings in terms of examinee ability. Compared to human raters, ChatGPT exhibited higher internal consistency and produced a narrower examinee ability distribution. ChatGPT ratings tended to focus on explicit features, such as specific conditions in real-life pragmatic tasks and formulaic expressions, while showing inconsistency in scoring off-task performances and implicit sociopragmatic dimensions. These findings are discussed in light of the potential of GenAI for low-stakes classroom assessment.
This paper aims to examine problematic areas in assessing children’s language learning, suggesting key solutions to the problems arising from different types of assessment. A critical evaluation of a variety of alternative assessment methods provided several teaching implications. First, assessment needs to be conducted through informal tests in which the learners cannot notice that they are being assessed. Although assessing young learners needs to be compatible with the more accessible learning such as activities used everyday in their classroom, coping with instructions for classroom activities needs to be handled with care. Assessing young learners through group or pair works can be more effective to enhance social and communicational skills than traditional tests. However, equity in relation to their participation in the activities, their English knowledge, and learning experience needs to be taken into serious consideration. Finally, more attempts to promote teacher-student interaction through student journals and conferencing assessment need to be made, even though this would not be culturally preferred learning style in Korea. This paper may thus give solutions for effective ways of assessing young learners from multiple perspectives rather than depending on only one assessment instrument.