With the growing interest in generative AI (GenAI) for language assessment, its potential as a rater has been discussed. This study compares trained human raters’ scores with GenAI ratings in assessing L2 pragmatic speaking performance across different task types. Fifty L2 English learners of varying proficiency levels completed pragmatic speaking test items, which were scored by five trained raters and ChatGPT-5. To examine the comparability, many-facet Rasch measurement was employed, focusing on examinees’ abilities, raters’ severity, item difficulty, and rating criteria functioning. Findings indicated a moderate correlation between GenAI and human ratings in terms of examinee ability. Compared to human raters, ChatGPT exhibited higher internal consistency and produced a narrower examinee ability distribution. ChatGPT ratings tended to focus on explicit features, such as specific conditions in real-life pragmatic tasks and formulaic expressions, while showing inconsistency in scoring off-task performances and implicit sociopragmatic dimensions. These findings are discussed in light of the potential of GenAI for low-stakes classroom assessment.