1. System overview The proposed system consists of three parts: pitch accent prediction, pitch accent detection, and feedback (Fig. 3). The prediction part utilizes given sentences to extract syntactic and lexical features and generates pitch accent patterns to provide learners with references for pitch accent learning. The detection part utilizes learners’ utterances as well as given sentences and identifies accented words in sentences. The feedback part provides learners with feedbacks on each word in each sentence. ·
2. Pitch accent prediction part The pitch accent prediction model uses the BU corpus to provide pitch accent patterns close to native English speakers. Previous research on pitch accents has been widely conducted on the BU corpus annotated with the ToBI. Among the previous works based on the BU corpus, we can refer to comparable studies such as ,  and , which analogously show remarkable performance and commonly adopted methods based on machine learning. The prediction model adopts a Conditional Random Field (CRF) classifier . Under the CRF framework, the existence of pitch accent with each word can be regarded as a label, and data for classification can be represented in a feature vector. To predict pitch accent patterns only with a given sentence, the prediction model requires syntactic and lexical features derived from the given sentence. Before extracting syntactic and lexical features from a given sentence, the sentence is analyzed by a POS tagger with an accuracy of 96.3%, based on . Syntactic and lexical features are then extracted from the tagged words. For the pitch accent prediction, the following syntactic and lexical features are utilized: Word identity POS tag Word class (function word/content word) Number of syllables
3. Pitch accent detection part The pitch accent detection model utilizes the KLEAC corpus to adapt the detection model to a Korean accentuation style, whereas the prediction model is trained using the BU corpus to provide pitch accent patterns that are close to those of native English speakers. Though the different corpora are applied to the models, the detection model also adopts the CRF classifier in the same way that the prediction model does. To detect which words are accented in a given sentence, the detection model requires acoustic features such as syllable duration, pitch, intensity and Mel-Frequency Cepstral Coefficients (MFCCs). To identify the timeline of words, a word list is imported from a given sentence and a forced alignment procedure is conducted. Based on the timeline of words, acoustic features are then extracted from acoustic parameters which are derived from the middle of vowels, where the consistency of the acoustic parameters is considered to be relatively higher. According to recent studies such as  and , syntactic and lexical features, in conjunction with acoustic features, are needed to achieve a high detection accuracy. Syntactic and lexical features are utilized to improve an accuracy of the detection model in the proposed system. For the pitch accent detection, the following features are utilized: Normalized duration of the syllable Normalized duration of the vowel in the syllable MFCCs of the syllable Normalized pitch mean of the syllable Normalized intensity mean of the syllable Silence before and after the word The same syntactic and lexical features that are in the prediction model Some of the features are normalized using the z-score. The z-score is calculated using the mean μ and standard deviation σ of the feature value x in an equation as follows: z = (x – μ) / σ. Because the CRF model cannot handle continuous feature variables, these variables should be discretized. The quantile discretization assigns each value in a dataset to a bin, where each bin receives an equal number of data values. This procedure alleviates the data sparseness problem and prevents features in greater numeric ranges from dominating those in smaller numeric ranges.
4. Feedback part To provide corrective feedback to learners, a comparison is performed between predicted and detected pitch accent patterns. As shown in Fig. 4, each word has positive feedback (the sign “O”) or negative feedback (the sign “X”) depending on whether it is correctly or incorrectly accented. However, if the system is not confident in judging the correctness of accents, it does not provide any feedback, and thus attempts to prevent giving incorrect feedback. Note that the color of the words represents the degree of accent: the closer to the color red, the greater the degree of stress. For example, in Fig. 4, the proposed system provides negative feedback for the word “For”, which is predicted to be unaccented (black) but is detected to be accented (red) by a given utterance. The corrective feedback for learners is determined by the comparison of predicted and detected pitch accent patterns, which are categorized into the following three groups: positive feedback (the sign “O”), negative feedback (the sign “X”), and empty feedback (no sign). For educational purposes, the positive feedback helps to motivate learning and the negative feedback helps to correct the mistakes of learners. Incorrect feedback, such as false positive and false negative feedback, however, adversely affects the reliability of the learning system and learning motivation as well. Therefore, if the comparison result is not trustworthy, feedback will not be provided. To decide the confidence of feedback, the adjusted score was designed by adopting the output probability of the CRF classifier for each stress label. The adjusted score was calculated by the absolute difference between the probabilities of the predicted and detected pitch accent and can be considered as a probability because the range of the adjusted score is between 0 and 1. For example, for a given word, suppose that the output probability of predicted stress is 0.8 (likely to be accented) and that of detected stress is 0.2 (likely to be unaccented). Then, the adjusted score for the confidence of feedback is 0.6 (= |0.8 – 0.2|). According to the adjusted score, the feedback is provided as follows: πpre and πdet are the output probability of predicted stress and of detected stress, respectively. θ1 and θ2 are the decision boundaries of the feedback groups .