Previous research related to automated plagiarism detection of natural language has mainly focused on written documents. A wide variety of techniques have been employed in this task, including n-gram overlap [5], document fingerprinting [6], word frequency statistics [7], Information Retrievalbased metrics [8], text summarization evaluation metrics [9], WordNet-based features [10], stopword-based features [11], features based on shared syntactic patterns [12], features based on word swaps detected via dependency parsing [13], and stylometric features [14], among others. Generally, in the monolingual plagiarism detection task, these methods can be grouped as external plagiarism detection, where a doc ument is compared against a body of reference documents, and intrinsic plagiarism detection, where a document is evaluated independently without a reference collection [15]. ·
urrent study adopts several of these features that are designed to be robust to the presence of word-level modifications between the source and the plagiarized text; since this study focuses on spoken responses that are reproduced from memory and subsequently processed by a speech recognizer, metrics that rely on exact matches are likely to perform sub-optimally. Little prior work has been conducted on the task of automatically detecting similar spoken responses, although research in the field of Spoken Document Retrieval [18] is relevant. In one similar previous study, [19] investigated several text-to-text content similarity metrics for detecting plagiarized spoken responses in the context of a large-scale standardized assessment of English proficiency for academic purposes.
This task is also related to the widely studied task of paraphrase recognition, which benefits from similar types of features [16, 17]. The cThat study reported an overall accuracy of 87.1% when 9 document-level similarity features were used in a decisiontree classifier. However, the distribution of plagiarized and non-plagiarized responses in the test set in that study was balanced, so that the result may not accurately reflect the performance in an operational deployment where the distribution is expected to be heavily skewed. To address this, the current study employs a large control set of non-plagiarized responses. Due to the difficulties involved in collecting corpora of actual plagiarized material, nearly all published results of approaches to the task of plagiarism detection have relied on either simulated plagiarism (i.e., plagiarized texts generated by experimental human participants in a controlled environment) or artificial plagiarism (i.e., plagiarized texts generated by algorithmically modifying a source text) [20]. These results, however, may not reflect actual performance in a deployed setting, since the characteristics of the plagiarized material may differ from actual plagiarized responses. Similar to [19], the current study is conducted on a corpus of actual plagiarized responses drawn from a large-scale assessment; however, the number of plagiarized responses in the current study is substantially larger, and is more than six times than the number of previously collected actual plagiarized responses.
The data used in this study was drawn from the TOEFL R Internet-based test (TOEFL R iBT), a large-scale, high-stakes assessment of English for non-native speakers, which assesses English communication skills for academic purposes. The Speaking section of TOEFL iBT contains six tasks to elicit spontaneous spoken responses: two of them require test takers to provide an opinion based on personal experience, which are referred to as independent tasks; and the other four tasks require them to summarize or discuss material provided in a reading and/or listening passage, which are referred to as integrated tasks [2]. In general, the independent tasks ask questions that are familiar to test takers and are not based on any stimulus materials. Therefore, they can provide responses containing a wide variety of specific examples. In some cases, test takers may attempt to game the assessment by memorizing canned material from an external source and adapting it to a question asked in the independent tasks. This type of plagiarism can affect the validity of a test taker’s speaking score and can be grounds for score cancellation.
However, it is often difficult even for trained human raters to recognize plagiarized spoken responses, due to the large number and variety of external sources that are available from online test preparation sites. In order to better understand the strategies used by test takers who incorporated material from external sources into their spoken responses and to develop a capability for automated plagiarism detection for speaking items, a data set of plagiarized spoken responses from the operational tests were collected. Human raters first flagged operational spoken responses that contained potentially plagiarized material, then rater supervisors subsequently reviewed them and made the final judgment. In the review process, the responses were transcribed and compared to external source materials obtained through manual internet searches; if it was determined that the presence of plagiarized material made it impossible to provide a valid assessment of the test taker’s performance on the task, the response was labeled as a plagiarized response and assigned a score of 0. In this study, 1,557 plagiarized responses to independent test questions were collected from the operational TOEFL iBT assessment across multiple years. During the process of reviewing potentially plagiarized responses, the raters also collected a data set of external sources that appeared to have been used by test takers in their responses. In some cases, the test taker’s spoken response was nearly identical to an identified source; in other cases, several sentences or phrases were clearly drawn from a particular source, although some modifications were apparent. Table 1 presents a sample source that was identified for several of the responses in the data set.3 Many of the plagiarized responses contained extended sequences of words that directly match idiosyncratic features of this source, such as the phrases “how romantic it can ever be” and “just relax yourself on the beach.” In total, human raters identified 211 different source materials while reviewing the potentially plagiarized responses, and assigned 162 of the passages as sources of the plagiarized responses included in this study. However, all of the 211 identified passages are used as sources in the experi ments in order to make the experimental design more similar to an operational setting in which the exact set of source texts that will be represented in a given set of plagiarized responses is not known. Summary statistics for the 211 source passages are as follows: the average number of words is 97.1 (std. dev. = 38.8), the average number of clauses is 10.9 (std. dev. = 5.5), and the average number of words per clause is 10.6 (std. dev. = 6.2). In addition to the source materials and the plagiarized responses, a set of non-plagiarized control responses was also obtained in order to conduct classification experiments between plagiarized and non-plagiarized responses. Since the plagiarized responses were collected over the course of multiple years, they were drawn from many different TOEFL iBT test forms, and it was not practical to obtain control data from all of the test forms that were represented in the plagiarized set. So, only the 166 test forms that appear most frequently in the canned data set were used for the collection of control responses, and 200 test takers were randomly selected from each form, without any overlap with speakers in the plagiarized set. The two spoken responses from the two independent questions in each test form were collected from each speaker; in total, 66,400 spoken responses from 33,200 speakers were obtained as the control set. Therefore, the data set used in this study is quite imbalanced: the size of the control set is almost 43 times of the size of plagiarized set. 4. METHODOLOGY This study first developed several features to measure the content similarity between a test spoken response and the source materials that were collected. Furthermore, a novel set of features were employed to deal with this particular task of plagiarism detection for spontaneous spoken responses. Since the production of spoken language based on memorized material is expected to be differentiated from the production of non-plagiarized speech in aspects of a test taker’s delivery, such as fluency, pronunciation, and prosody, we also evaluate the contribution of a range of features based on acoustic cues from spontaneous speech. 4.1. Content Similarity Based on previous work [19] that has shown the effectiveness of using content-based features for the task of automatic plagiarized spoken response detection, this work also employs several features based on text-to-text similarity. Given a test response, a comparison is made with each of the 211 reference sources using the following content similarity metrics: 1) BLEU [22]; 2) Latent Semantic Analysis (LSA) [23]; 3) a WordNet similarity metric based on presence in the same synset; 4) a WordNet similarity metric based on the shortest path between two words in the is-a taxonomy; 5) a WordNet similarity metric similar to (4) that also takes into account the maximum depth of the taxonomy in which the words occur [24]; 6) a WordNet similarity metric based on the depth of the Least Common Subsumer of the two words [25]; 7) a WordNet similarity metric based on Lin’s Thesaurus [26] For each of The BLEU- and LSA-based metrics, 211 document-level similarities are generated by comparing a test response against each of the 211 source materials; then, the maximum similarity is taken as a single feature to measure the content overlap between the te st response is first compared with each word in the source, and the maximum similarity value is obtained for each word in the test response; subsequently, the maximum similarity scores for all the words in the test response are averaged to generate a feature measuring document-level similarity for the WordNet-based metrics. Features based on the BLEU metric have been proven to be effective in measuring the content appropriateness of spoken responses in the context of English proficiency assessment [27] and in measuring content similarity in the detection of plagiarized spoken responses [19]. This work builds on these previous results by investigating additional features based on the BLEU metric and its variations. First, the traditional BLEU metric combines the modified n-gram precisions to evaluate a machine-generated translation against multiple reference translations, where n generally ranges from unigram up to 4-gram [22]. Here, given a test response, the standard BLEU score is calculated for each of the 211 sources and the maximum value is obtained as one similarity feature. Similarly, we also generated in total 11 different BLEU scores by varying the maximum n-gram order from unigram to 11-gram as features. The intuition behind decreasing the maximum order is to increase the classifier’s recall by evaluating the overlap of shorter word sequences, such as individual words in the unigram setting. On the other hand, the motivation behind increasing the maximum order is to boost the classifier’s precision, since it will focus on matches of longer word sequences. Here, the maximum order of 11 was selected based on the average number of words per clause in source materials, which is near 11, as described in Section 3. In order to verify the effectiveness of the BLEU features that were extracted by varying the maximum n-gram order, a preliminary experiment was conducted. 10-fold cross-validation was performed on the whole data set using the decision tree classifier from SKLL [28]. The 11 BLEU features were extracted based on the output of an automatic speech recognition (ASR) system with a word error rate of around 28%. Precision, recall, and F1-measure on the positive class, i.e., the plagiarized responses, are used as the evaluation metrics. Further details about the experimental set-up can be found in Section 5.1. As shown in Table 2, compared with the standard BLEU score, the recall can be improved from 0.393 to 0.447 by ranging the maximum ngram order from 1 to 4. Further extending the maximum order to 11 can boost the precision from 0.429 to 0.438. The combination of 11 BLEU features can improve the F1-score from 0.425 (with the maximum order of 4) to 0.44