The present study investigated students’ preferences for the types of tasks used to assess English speaking performance. It further examined whether students’ task type preferences affected their perceptions of test effectiveness. One hundred eighty-two high school students responded to a self-report questionnaire. A series of frequency analysis and paired samples t-tests were used for the analysis. The results showed that students’ most preferred task types and their least preferred ones overlapped with each other, suggesting that the task types of English-speaking performance tests used in schools are limited. The four key reasons determining students’ task type preferences were identified, including task difficulty, emotional comfort, practical value, and interest. In addition, the results indicated that students’ task type preferences could affect their perceptions of task effectiveness. Overall, the results suggest the need for developing more varied task types for English-speaking performance tests as well as helping students become familiar with English speaking performance tasks. Pedagogical implications were discussed along with study limitations.
The purpose of this study was to investigate inter- and intra- rater reliability in an interview and a computerized oral test. It was also examined whether rater characteristics influenced on their reliability and biases, and finally the scores of both tests were compared with those of the Versant test using an automated computer rating system. For the study, the data from 21 Korean university students and 18 Korean or native speakers of English raters with various characteristics were collected. Some of the main findings from the study were as follows. First, rater severity was significantly different in each test, but each rater consistently graded on both tests suggesting lower inter-rater reliability and higher intra-rater reliability. Secondly, rater severity was impacted by the rater characteristics such as mother tongue, gender, age, and major. Lastly, there existed a positive correlation among the scores of the three tests, indicating that the scores of human beings and computers are strongly related.
The present study investigates and identifies the types of grammar errors that 84 Korean EFL (KEFL) learners made when they took the TOEIC Speaking test part five and compared the results with the previous studies. The 84 undergraduates participated and two native speakers of English analyzed the data. The data were classified into the taxonomy of four surface strategies: omission, addition, misformation, and misordering. Errors of omission were the highest consisting of 74.9%, followed by those of misformation at 19.9%, addition at 3.5%, and misordering at 1.7%. The findings had a difference from the previous studies adopted for comparison. The error frequencies were very different from those observed in writing, or interview tasks in the previous studies. Errors of misformation were the highest, followed by those of omission. This difference was considered to be caused by the short test taking time, and the task itself. The KEFL learners were unlikely to try to correct their errors within the limited time since very few self-corrections were observed in their spoken answers. This study suggests that explicit grammar instruction and correction are needed to teach speaking.
This study investigates secondary school English teachers’ perceptions and psychological burdens involved in the implementation of the speaking and writing tests of the National English Ability Test, which is being developed by the Korean Ministry of Education, Science and Technology. The study surveyed 138 secondary school English teachers in Seoul. Although more than half of the teachers were aware of the new test, 18% of the surveyed teachers were not aware of the fact that speaking and writing skills would be assessed in the new test. Also, 22.7% of the teachers were opposed to the productive skills test. More than half (56.2%) of the teachers felt some psychological burdens toward the inclusion of the speaking/writing tests. Although the teachers admitted that serving as raters for the new test would help improve their teaching, the majority of them were reluctant to participate in the actual rating process. The teachers felt that the difficulty of subjective rating and the lack of time for the speaking and writing tests were serious problems in implementing the new test. The teachers were sensitive toward the students’ test anxiety. They also indicated that they feel a strong psychological burden when making judgments on the students' performances. Implications and suggestions are made based on the findings.
Face-to-face interview has been used to elicit language samples in oral proficiency testing. The direct testing is believed to measure more authentic and interactive language ability. However, it is argued that speech samples from unstructured interviews are different from ones in natural communication settings, and that the interviewees are forced to play as passive roles. Furthermore, inexperienced interviewers often allow superficial fluency rewarded. Communicative task (e.g., long-run narrative and descriptive speech) meaningfulness tends to be underestimated. In the unstructured interview situations where intuition driven interviewers do not provide meaningful communicative tasks (or prompts), personality is implicitly focused, and interpersonal strategies are overestimated. When a test intends to assess ‘language’ skills, the nonlanguage skills make test usefulness dubious. This study explores inappropriateness issues of face-to-face English language proficiency interviews in Korea. This study also values the usefulness of semi-direct (tape- or computermediated) speaking tests in terms of task meaningfulness. Samples of superficial fluency collected in testing settings are commented upon. Points of view from Korean contexts are repetitively discussed over the study.
Even though performance-based language tests are often multivariate by design, previous validation studies have analyzed the dependability of ratings of such tests within univariate analytic frameworks. In an attempt to address this limitation, the present study investigated the dependability of judgements made based on ratings of a German speaking test using a multivariate generalizabilitiy theory (MGT) analysis. Data obtained from 88 students were analyzed in G- and D-studies of a two-facet crossed design. A D-study with two raters and two tasks supported the high dependability of the ratings and the placement decisions made at the pre-determined cut scores. An optimization analysis of the measurement procedure suggested that the desired level of dependability of ratings has already been achieved by the use of two raters and two tasks. The MGT analysis also generated other useful information about the convergent/ discriminant validity of the five subscales. Specifically, the universe score variance-covariance matrix obtained from the D-study showed that the underlying subscales were interrelated but distinct. Furthermore, the analysis of effective weights of the scales revealed that the Grammar subscale played the dominant role in the composite universe score variance.