Examining rater reliability and bias in measuring English speaking performance: Through a comparison of scores on an oral interview, a computerized oral test and a Versant
The purpose of this study was to investigate inter- and intra- rater reliability in an interview and a computerized oral test. It was also examined whether rater characteristics influenced on their reliability and biases, and finally the scores of both tests were compared with those of the Versant test using an automated computer rating system. For the study, the data from 21 Korean university students and 18 Korean or native speakers of English raters with various characteristics were collected. Some of the main findings from the study were as follows. First, rater severity was significantly different in each test, but each rater consistently graded on both tests suggesting lower inter-rater reliability and higher intra-rater reliability. Secondly, rater severity was impacted by the rater characteristics such as mother tongue, gender, age, and major. Lastly, there existed a positive correlation among the scores of the three tests, indicating that the scores of human beings and computers are strongly related.