Mandarin Chinese Conversational Telephone Speech & Transcripts, PART 1
(Speech Corpus: LDC2005S15; Transcript Corpus: LDC2005T32)
1. Summary
In 2004, the Hong Kong University of Science and Technology (HKUST)
was contracted to collect and transcribe 200 hours of Mandarin Chinese
conversational telephone speech from Mandarin speakers in mainland
China under the DARPA EARS framework. The first 50 hours of speech and
transcripts were released in June, 2004 to the EARS community for the
RT-04 NIST evaluation. NIST partitioned the remaining 150 hours of
collection into training, development and evaluation sets. This
release contains the training and development sets with 873 and 24
calls, respectively.
2. Data collection
Subject recruitment was done in several cities across mainland
China. Most subjects did not previously know each other. To encourage
more meaningful conversation, topics similar to those in Fisher
English were designed. All calls were operator-assisted, namely, an
operator would call two participants as scheduled to initiate a
call. Subjects were asked about demographic questions before they were
bridged for normal conversation. Their answers to the demographic
questions were recorded on separate files.
Subjects were allowed to talk up to 10 minutes. With a few exceptions,
most calls are of the maximum length. Although subjects were allowed
to make up to 3 calls, all subject made just one call in this release
with one exception, where PIN 10683 and PIN 10686 belong to a single
individual.
Each side of a call was recorded on a separate wav file, sampled at 8
bits (a-law encoded), 8Khz. They were multiplexed later in sphere
format with a-law encoding preserved. In the case where one side was
shorter than the other, the shorter side was padded with silence. In
the release, the file name of each recorded call is in the format of
"date_time_Apin_Bpin.sph" and the corresponding transcript is in the
same format with .txt extension.
3. Speaker demographics
Subjects were asked to provide several pieces of demographic
information, including gender, age, native language/dialect,
birthplace, education, occupation, phone type, etc. Given that
Standard Mandarin is not the native dialect in many regions of China
while it is also the official language of education and speakers may
or may not have regional accent speaking Mandarin, it was decided that
subjects' birthplaces were divided into Mandarin-dominant and
non-Mandarin-dominant regions and all calls were audited and
classified into standard and accented types without further
distinctions.
Selected demographics - age, gender, birthplace, phone type and accent
for each side of the call and the topic ID for the call - are provided
as a tab-delimited, plain-text, tabular file.
4. Transcription
All calls were fully transcribed from the beginning to the
end. Standard simplified Chinese characters, encoded in GBK (CP-936),
were used. Speech is segmented at natural boundaries wherever possible
and each segment is no more than 10 seconds long. HKUST formulated
transcription guidelines based on LDC's RT-03 transcription
guidelines. For more information, refer to "trans-guidelines.pdf"
included in the release.
The transcripts provided by HKUST were XML-formatted with each side of
a call in a separate file. LDC multiplexed the two sides into a single
file with turns interleaved in temporal order (based on the initial
time stamps), and converted the format into the LDC format. All
transcripts were checked against RT-04 formatting standards. The
following is a list of RT-04 conventions that are different from those
in the transcription guidelines.
(1) Speaker noise: curly brackets, e.g. {laugh}, instead of angel brackets;
(2) Foreign language: TEXT
instead of TEXT.
The Chinese text is not segmented into words, though there are
occasional white spaces within some turns.
5. Directory structure
Speech and transcript data are released as separate corpora. Speech
files are provided on a set of two DVD's under a top-level "audio"
directory on each disc; the transcript files are provided in a single
"tar" file, with "trans" as the top-level directory. In both cases,
the data files are further subdivided into "train" and "dev"
(development test set) directories, according to the balanced test-set
selection process applied to the corpus as a whole by NIST.
Speech discs:
audio/
train/ (present on both discs in the set)
dev/ (present only on disc 2 of the set)
Transcript tar file:
trans/
train/ (all 873 training files)
dev/ (the 24 development-test files)
The tar file and the two DVD's also contain a common "docs" directory,
containing the table of speaker demographics and call information, the
transcription guidelines, the topic descriptions, and this readme file.