We have developed an English-Chinese cross-language spoken document retrieval (CL-SDR) system, where English textual queries are used to retrieve Mandarin spoken documents, i.e. a cross-language and cross-media information retrieval task. With the growing multi-media and multi-lingual content in the global information infrastructure, CL-SDR technologies are potentially very powerful, as they enable the user to search for personally relevant audio content, (e.g. recordings of meetings, lectures or radio broadcasts), across the barriers of language and media. Our system accepts an entire English textual story (from newspapers) as the input query, and automatically retrieves relevant Mandarin audio stories (from radio broadcasts). ยทย We refer to the English story as our query exemplar, and this retrieval context as query-by-eromple. Our task is illustrated in Figure 1. Mandarin is the key dialect of Chinese. English and Chinese are two predominant languages used by the global population. They are very different linguistically, hence EnglishChinese CL-SDR presents unique research challenges. A prevailing problem in our task is that the topically diverse news domain contains many named entities, and these are oflen out-of-vocabulary words (OOV) in recognition and translation. In word recognition for audio indexing, OOV’ may be erroneously substituted by other in-vocabulary words. Our ‘ These are words unknown to the speech recognizer solution to this problem is to use syllable remgnition, where the OOV is transcribed as its constitlltent syllables. This is feasible because a compact inventory of approximately 400 base syllables can provide 111 phonological coverage for the Chinese language. Additionally, a syllable forms the pronunciation of a Chinese character with a many-to-many mapping. An inventory of approximately 6,000 characters provides full textual coverage in Chinese. However, the Chinese word may consist of one to multiple characters, hence character combinations can produce an unlimited number of Chinese words. There is no explicit word delimiter and the task of segmenting a character sequence into a word sequence contains much ambiguity. Consequently, we have augmented word-based retrieval with character- and syllable-based retrieval. We use overlapping character/syllable n-grams t? circumvent the problem of tokenization ambiguity. CharacteriSyllable bigrams fare best among n-grams in retrieval performance, and character bigrams outperform words, based on our experiments with the Topic Detection and Tracking (TDT) Collection from the LDC.’
We have incorporated the automatic names transliteration procedure into our task of English-Chinese CL-SDR. The experiment was based on the TDT Collection. Query exemplars were drawn from English news text (from the New York Times and Associated Press). Audio documents were drawn from Voice of America news broadcasts in Mandarin. The TDT collection has manual, exhaustive topic annotations that serve as relevance judgements for retrieval. There are I7 topics in total in the collection, and we included up to 12 query exemplars for each topic in our retrieval experiments. Retrieval performance is measured by non-intepolated mean average precision. (mAP). As mentioned earlier, we used both words and character bigrams for retrieval, and the latter outperfom the former, as shown in the TDT-2 results in Table 2. We extracted the 200 most common named entities that have been tagged in our query exemplars (by the BBN Identifinder). These are processed by our named entity transliteration procedure and the output syllable sequences are used to augment the translated Chinese query. From Table 2 we see that named entity transliteration brought about -11 but consistent improvements to both word-based and character-based retrieval. The improvement is not statistically significant, though we believe lhis is due to the limited number of names have been transliterated. This is an ongoing research effort, and we plan to Mer investigate ways to enhance retrieval performance by handling OOV via transliteration.