Linguistic features that are indicative of higher writing proficiency levels can inform many aspects of lanauage assesment such as scoring rubrics, test items, and automated essay scoring(AES). The recent advancement of computer algorithms that automatically calculate indicates based on various linguistic features has made it possible to examine the relationship between linguistic features and writing proficiency on a larger scale. While the ability to use appropriate n-grams - recurring sequences of contiguous words - has been identified as a characteristic differentiating between proficiency levels in the literature, few studies have examined this relationship using computational indices. To this end, this study utilized the Tool for the Automatic Analysis of Lexical Sophistication(TAALES;Kyle&Crossley, 2015) to calcualte eight indices based on n-grams from a stratified corpus consisting of 360 argumentative essays written by Korean college-level learners. First, the indices from the training set of 240 essays were used to design a multinomial logistic regression model in order to identify indices that are significant predictors of writing proficiency levels. Subsequently, the regression model was applied to a test set of 120 essays to examine whether the model could be used to predict the proficiency levels of unseen essays. The results revealed that the mean bigram T, mean bigram Delta P, mean bigram-to-unigram Delta P, and proportion of 30,000 most frequent trigrams indices were significant predictors of proficiency levels. Furthermore, the regression model based on eight indices correctly classfied 52.5% of essays in the test set, demonstrating above-chance level accuarcy.
The purpose of this paper is to investigate whether second language writings at different proficiency levels can be distinguished using automatic indices of linguistic complexity. For this study, 35 linguistic measures in 234 essays selected from the Yonsei English Learner Corpus were analyzed in order to identify the best indicators of L2 writing proficiency among the three categories: text length, lexical complexity, and syntactic complexity. The key to this study is the use of computational tools, the L2 Syntactic Complexity Analyzer and the Lexical Complexity Analyzer, which measure different linguistic features of the target language, and a robust statistical method, discriminant function analysis. Results showed that automatic computational tools indicated different uses of linguistic features across L2 writers’ proficiency levels. Specifically, more proficient writers produced longer texts, used more diverse vocabulary, and showed the ability to write more words per sentence and more complex nominalizations. These findings can offer a window to understanding the linguistic features that distinguish L2 writing proficiency levels and to the possibility of using the new computational tools for analyzing L2 learner corpus data.