Our English query exemplars have been tagged by the BBN Identifinder (Bike1 et al., 1997) system for named entities. The tagged units which are not found in OUT translation dictionary will be processed by our transliteration system. In the following, we provide a description for every module in Figure 2. 2.1 Deted Chinese Names The fist step in our process is to detect romanized Chinese names. These may be in the (commonly used) Wade Giles or pinyin conventions? We have extracted the hvo syllable inventories from the Internet, as well as the mapping from Wade Giles to pinyin. Detection of romanized Chinese names is achieved by a lee-Wright maximum-matching (greedy) segmentation algorithm. The two syllable lists are used in hlm for segmentation, since only one convention will be used at a time. If we can successfully segment the input named entity into a sequence of Chinese syllables, our procedure returns the corresponding pinyin syllable sequence, which can be used for query formulation in retrieval. ·   Otherwise we proceed to the next step. 2.2 Generate English Pronunciations If the input is not a romanized Chinese name, we attempt to automatically acquire a pronunciation for the foreign name in terms of English phonemes. We begin by looking up the pronunciation lexicon PRONLEX provided by LDC. If the name is found, this procedure outputs an English phoneme sequence. Otherwise the spelling of the name is passed to our automatic letter-to-phoneme generation process. Our letter-tophoneme generator applies a set of rules to generate an English pronunciation from the input spelling. This set of letter-to-phoneme NI- has been automatically inferred from data by the following process: We used the entire PRONLEX lexicon which contains 90,000 words for training. For each word, we aligned the spelling with the pronunciation in a Viterbi-style to achieve a one-to-one letter-to-phoneme mapping, e.g. “appraise” is aligned with /ax pp null IT cy null zz null/. A /null/ phoneme is inserted when we encounter geminate letten, or in cases where more than one letters map into a single phoneme. We then apply the transformation-based error-driven learning pL) approach (Brill 1995) to these alignments to obtain a set of transformation rules for spelling-to-pronunciation generation. Referring to Figure 1, these rules were able to generatc the pronunciation /kk rr ih ss tt aa ff er/’ for the input spelling “Christopher”. 2.3 Apply Cross-Lingual Phonological Rules Chinese is monosyllabic in nature, but English is not. Therefore we observe some phonological differences between the two languages. For example, the name Bush is pronounced as a single syllable hb uh sh/ in English, but transliterated as two syllables in Chinese – mu shd. Another example, e.g. Clinton M II ih M tt ih nd contains a consonant cluster (/kk IU), but its Chinese transliteration inserts a syllable nucleus in between the consonants, and is pronounced as ke lin dud. We have written a set of phonological NI= to tansform the English pronunciation, in an attempt to bridge some of the discrepancies mentioned above. This serves to ease the subsequent process of cross-lingual phonetic mapping (CLPM). Examples of des include: Insert a reduced syllable nuclei (the ‘schwa’ /ax/) between clustered consonants. This takes care of pronunciations as in the example Clinlon mentioned earlier. 0 Duplicate the nasals Id, lnd and /nxJ (syllabic nasal) whenever they are surrounded by vowels. For example, DiaM, pronounced as /dd ay ae nn ax/ in English, is often transliterated as Idai an na/ in Chinese, where the nasal Inn/ forms part of the syllable final in the second syllable, as well as the onset of the third syllable. 0 For all consonant endings, except AI/, append a syllable nuclei (Id) to it. For example, Bennert, pronounced as /bb eh m ih tU in English, is often transliterated as hei nei te/ in Chinese. If the syllable ends with All, it is treated differently – consider the example Bell, pronounced as hb eh IU in English, and often transliterated as ibei er/ in Chinese. 2.5 Cross-lingual Phonetic Mapping (CLPM) This procedure aims to map the English phonemes into Chinese “phonemes” (derived from syllable initials and finals) by applying a set of transformation rules. Again, these rules are learnt automatically from data by the technique of transformation-based errordriven learning (EL). The process is as follows: We collected a bilingual proper name list which contain English proper nantes with their Chinese transliterations. Our list is derived from LDC’s English-Chinese bilingual term list with CETA (Chinese-English Translation Assistance), a list from the National Taiwan University? and some name pairs harvested from the Internet. We randomly allocated !raining and test sets, with 2233 and I541 names respectively. Each name pair contains the English name and corresponding Chinese translation I transliteration. We looked up the English name pronunciation from PRONLEX, and the Chinese pronunciation from LDC‘s Mandarin CALLHOME lexicon. We obtained a one-to-one phoneme-to-phoneme alignment between the English name pronunciation and the Chinese name pronunciation by means of a finite-state transducer (FST) (Molui et al., 1998). The FST was initialized with some obvious English-phonemeto-Chinesephoneme correspondences, and ’ The InulUphoneme has been discarded in the generated output. ’ This list is provided by H. H. Chen from National Taiwan University. trained iteratively on a set of phoneme pain until convergence is reached. The converged FST is used to align our training words, and then we applied TEL to derive a set of transformation rules to map English phonemes into Chinese phonemes. Given a testing English phoneme sequence, application of our tansformation rules will generate a single Chinese phoneme sequence. 1.6 Generate a Chinese Phoneme Lattice Based an an English phoneme sequence, CLPM generates a single Chinese phoneme sequence as output. We need to apply Chinese syllabic constraints to this phoneme sequence to produce a syllable sequence (in pinyin). However, this Chinese phoneme sequence may eontain errors. In order to include phoneme alternatives prior to syllabification, we try to capture common confusions in CLPM. To do this, we applied our transformation rules to each English pronunciation in the training set, and compared the generated Chinese phoneme sequence with the reference sequence to produce a confusion matrix. The matrix stores the frequency of confusion for each reference-phonendoutput-phoneme pair. Upon testing, the confusion matrix is used to generate a phoneme lattice prior to syllabification. A phoneme lattice is illustrated in Figure 3. Given an English name (Cecil Taylor) and its English pronunciation (note that this is an overgeneralization because not all names are of English origin, but we treat them as such for the sake of simplicity in lener-tophoneme generation), we applied CLPM to give a wrresponding Chinese phoneme string Is a x e er t ai 1 el (first row of nodes). For each Chinese phoneme in this string, we expand with all its confusable alternatives by referencing the confusion matrix. For example, the first Chinese phoneme Id has been confused with la/ and /k/, and these are insetted to form a lattice. Similarly, the second phoneme /a/ has been confused with /ail which gets inserted as well. The inserted nodes in the lattice are also weighted by their probability of confusion, derived from the statistics in the confusion matrix. The expanded nodes serve to provide alternative phonemes for syllabification. 2.7 Search Syllable Graph with a Syllable Bigrsm We search our phoneme lattice exhaustively for Chinese phoneme sequences which can constitute legitimate syllables, to create a syllable graph (see Figure 4). We then traverse the graph by A* search to find the N most probable syllable sequence. Probabilities derived from the confusion matrix, as well as those from a syllable bigrams language model are considered. The syllable bigram language model is trained from Language Model a list of 3,628 Chinese names harvested from the Internet. This configuration is capable of hypothesizing N-best syllable sequences – we currently set N=l for the sake of simplicity. The idea behind this step and the previous one is inspired by lexical access in speech recognition, which produces word hypotheses from a lattice of recognized phones. Indeed ifwe use a character bignun instead of the syllable bigram during A* search, we can porenriully generate and N-best list of character sequences, e.g. generating &XR!# %for Christopher. The pronunciation of the character sequence is /ji li si te W. Based on our test set of 1541 names, this procedure gave a transliterated syllable accuracy of about 47.5%.

Related Post