CALLHOME Mandarin Chinese Transcripts - XML version

Item Name: CALLHOME Mandarin Chinese Transcripts - XML version
Author(s): Tony McEnery, Richard Xiao
LDC Catalog No.: LDC2008T17
ISBN: 1-58563-485-9
ISLRN: 741-988-462-570-4
Release Date: September 15, 2008
Member Year(s): 2008
DCMI Type(s): Text
Data Source(s): telephone conversations
Application(s): tagging, spoken dialogue modeling, speech recognition, language modeling
Language(s): Mandarin Chinese
Language ID(s): cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2008T17 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: McEnery, Tony, and Richard Xiao. CALLHOME Mandarin Chinese Transcripts - XML version LDC2008T17. Web Download. Philadelphia: Linguistic Data Consortium, 2008.
Related Works: View


CALLHOME Mandarin Chinese Transcripts - XML Version, Linguistic Data Consortium (LDC) catalog number LDC2008T17 and isbn 1-58563-485-7, was developed at Lancaster University, United Kingdom.

LDC's CALLHOME Mandarin Chinese collection includes telephone speech, associated transcripts and a lexicon. CALLHOME Mandarin Chinese Speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. All calls, which lasted up to thirty minutes, originated in North America and were placed to locations overseas; most participants called family members or close friends. CALLHOME Mandarin Chinese Transcripts covers a contiguous five or ten-minute segment from each of the telephone speech files. The transcripts are in tab-delimited format with GB2312 encoding, are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. CALLHOME Mandarin Chinese Lexicon is comprised of over 40,000 words from twenty CALLHOME Mandarin transcripts.

CALLHOME Mandarin Chinese Transcripts - XML Version, the latest addition to this collection, presents the entire original corpus of 120 transcripts in XML format with UTF-8 encoding, retokenization and part-of-speech (POS) tagging. The retokenization and POS information were supplied using the Chinese Lexical Analysis System (ICTCLAS) developed by the Institute of Computing Technology, Chinese Academy of Sciences, Beijing. ICTCLAS aims to incorporate Chinese word segmentation, POS tagging, disambiguation and unknown words recognition into a single theoretical framework using multi-layered hierarchical hidden Markov models.

In addition to the original applications for Mandarin Chinese CALLHOME data (e.g., speech recognition), CALLHOME Mandarin Chinese Transcripts - XML Version will be useful in the grammatical study of spoken Mandarin.


This XML corpus retains all of the linguistic analyses (e.g., timestamps, spoken features and proper nouns) from the original transcripts release, but the mnemonics used in the original release were migrated into XML markup following the mapping rules described below:

All analyses in the original release were retained at the sacrifice of tokenization and part-of-speech tagging accuracy (e.g., some mnemonics encoding spoken features may split a word, which can affect the tagging accuracy). However, the results of the automated processing were substantially post-edited. For example, four aspect markers in Chinese (-le, -guo, -zhe and zai) were disambiguated and corrected by hand; all of the classifiers (also called "measure words") were re-tagged using a more fine-grained annotation scheme developed on the Lancaster University project. In addition, a large number of obvious typographical errors in the original release were corrected in the process of post-editing.

Number of unique words: 6,895 Total number of words: 300,767


Available Media

View Fees

Login for the applicable fee