Due to the vast amount of easily accessible online text on a wide variety of topics, plagiarism of written content has become a vexing problem for educators. To address this problem, several services are available for the automatic plagiarism detection of written texts.1 Furthermore, a series of shared tasks has enabled a variety of approaches to plagiarism detection to be compared on a standardized set of text documents . In addition to the domain of writing evaluation, plagiarism can also pose problems for the domain of spoken language evaluation, in particular, the evaluation of non-native speaking proficiency. In the context of large-scale, standardized assessments of spoken English for academic purposes, such as the TOEFL iBT test , the Pearson Test of English Academic , and the IELTS Academic assessment , test takers may utilize content from online resources in their spoken responses to answer test questions that are intended to elicit spontaneous speech. These responses that are based on canned material pose a problem for both human raters and automated scoring systems, and can reduce the validity of scores that are provided to the test takers. In this paper, we investigate a variety of features for automatically detecting plagiarized spoken responses in the context of a standardized assessment of English speaking proficiency. · In addition to examining several commonly used text-to-text content similarity features that have been shown to be useful for detecting plagiarized written texts, we also propose novel features that compare various aspects of speaking proficiency across multiple responses provided by a test taker, based on the hypothesis that certain aspects of speaking proficiency, such as fluency, may be artificially inflated in a test taker’s canned responses in comparison to non-canned responses. These features are designed to be independent of the availability of the reference source materials. Finally, we evaluate the effectiveness of the system on a data set with a large number of control responses in an attempt to simulate the imbalanced distribution from an operational setting in which only a small number of the test takers’ responses are plagiarized.
The performance of features in Section 4.1 greatly depends on the availability of a comprehensive set of source materials. If a test taker uses unseen source materials as the basis for a plagiarized response, the system may fail to detect it. Therefore, this study also examines a novel set of features that do not rely on a comparison with source materials. As described in Section 3, the Speaking section of the TOEFL iBT assessment includes both independent and integrated tasks. In a given test administration, test takers are required to respond to all six test questions and plagiarized responses are more likely to appear in the two independent tasks, since they are not based on specific reading and/or listening passages and thus elicit a wider range of variation across responses. Since the plagiarized responses are mostly constructed based on memorized material, they may be delivered in a more fluent and proficient manner compared to the responses that contain fully spontaneous speech. Based on this assumption, we propose a novel set of features to capture the difference between various acoustic cues extracted from the canned and spontaneous speech produced by the same test taker; this methodology is specifically designed to detect plagiarized responses to the independent tasks. These features were developed based on an automated spoken English assessment system, SpeechRaterSM [29, 30]. SpeechRater can automatically predict the holistic speaking proficiency score of a spoken response and generate a set of approximately 100 features to assess different aspects of spontaneous speech. In this study, the automated proficiency scores along with 29 SpeechRater features measuring fluency, pronunciation, prosody, rhythm, vocabulary, and grammar were used. Since most plagiarized responses are expected to occur in the independent tasks, we assume the integrated responses are based on spontaneous speech. A mismatch between the proficiency scores and the feature values from the independent responses and the integrated responses from the same speaker can potentially indicate the presence of both prepared speech and spontaneous speech, and, therefore, the presence of plagiarized spoken responses. Given an independent response from a test taker, along with the other independent response and four integrated responses from the same test taker, 6 features can be extracted according to each of the proficiency scores and 29 SpeechRater features. First, the difference of score/feature values between two independent responses was calculated as a feature, which was used to deal with the case in which only one independent response was canned and the other one contained spontaneous speech. Then, basic descriptive statistics, including mean, median, min, and max, were obtained across the four integrated responses. The differences between the score/feature value of the independent response and these four basic statistics were extracted as additional features. Finally, another feature was also extracted by standardizing the score/feature value of the independent response with the mean and standard deviation from the integrated responses. In total, a set of 180 features were extracted, referred as SRater in the following experiments.