Recently, Computer Aided Language Learning (CALL) has received considerable attention as a method for improving the English speaking skills of non-native students. In order for CALL systems to provide useful tutoring feedback, an automated scoring system is required to evaluate the pronunciation quality, and fluency of non-English native-speaker students, as well as, specific mistakes made by them. Most automatic spoken English fluency scoring systems have three components: Automatic Speech Recognition (ASR), fluency feature extraction, and a scoring model. ASR generates time-aligned word sequences for an input speech. The fluency feature extraction computes the features that are assumed to be highly correlated to fluency in spoken English [1, 2, 3]. In [1, 4], various features are investigated. Among them, long silence duration, silence duration, number of words per second, phone duration are found to be common fluency features. Because the contribution of each feature to the final score is different, a small number of relatively significant features are selected using feature selection techniques [2, 4]. ·
The scoring model is a kind of regression that predicts the score of an input fluency feature vector. In general, the scoring model is implemented by linear regression [2 ], a support vector machine (SVM) , or the Gaussian process . It is recently reported that the Deep Neural Network (DNN) based approach improves the performance of spoken English fluency scoring by using a DNN-based acoustic model and confidence features [6 , 7]. Although the conventional scoring systems work well, some issues remain. First, fluency features are computed based on expert knowledge. Second, feature extraction and scoring model parameters are optimized separately instead of jointly. The hand-crafted features are proposed based on suggestions from the literature [8 ] and from experts in test development and training of human raters [1,2]. Therefore, some characteristics that are embedded in a raw data corpus can be missed. In addition, the separate model parameter optimization process can lead to suboptimal performance. To address these issues, we propose the Convolutional Neural Network (CNN)-based approach to learn fluency features directly from raw data corpus and optimize all model parameters jointly under the same criterion. CNN-based approaches are well known in the ASR area [9, 10, 11]. Therefore, our contribution is to investigate the feasibility of the CNN-based approach for spoken English fluency scoring problem. The rest of this paper is organized as follows. In Section 2, we briefly describe the system used to score spoken English fluency. In Section 3, we describe general CNN components. In Section 4, we present our approach in detail. Section 5 describes the experiments and results. Section 6 and 7 present the discussion and conclusions, respectively.
SPOKEN ENGLISH FLUENCY SCORING SYSTEM In this section, we briefly describe spoken English fluency scoring system. Figure 1 shows a general fluency scoring system. An input raw speech signal is converted into a sequence of features such as Mel Frequency Cepstral Coefficients (MFCCs) and Automatic Speech Recognition (ASR) generates time-aligned word sequences. Fluency features are computed based on ASR output and expert knowledge. A scoring model is trained to map these features to scores, and then used to predict a score. In this work, we focus on replacing the fluency feature extraction and the scoring model by CNN and we investigate the feasibility of the proposed method.