Switchboard Cellular Part 1 Transcription
|Item Name:||Switchboard Cellular Part 1 Transcription|
|Author(s):||David Graff, Kevin Walker, David Miller|
|LDC Catalog No.:||LDC2001T14|
|Data Source(s):||telephone conversations|
|Project(s):||SID, GALE, EARS|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2001T14 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David, Kevin Walker, and David Miller. Switchboard Cellular Part 1 Transcription LDC2001T14. Web Download. Philadelphia: Linguistic Data Consortium, 2001.|
The Switchboard Cellular Part 1 collection focused primarily on GSM cellular phone technology. The collection commenced 11/12/1999 and was completed on 05/15/2000. The project's goal was to target 190 subjects balanced by gender and under varied environmental conditions to participate in (10+) five to six minute conversations on GSM cellular phones. The speech data was collected for research, development, and evaluation of automatic systems for speech-to-text conversion, talker identification, language identification and speech signal detection purposes.
The Switchboard Cellular Part 1 Transcription was produced by the Linguistic Data Consortium, catalog number LDC2001T14 and ISBN number 1-58563-214-7. This release contains the 250 transcriptions of speech data files that correspond with the Switchboard Cellular Part 1 Transcribed Audio (LDC2001S15), along with documentation describing speaker information (sex, age, education, city and state where raised), call information (date, time, call duration, Personal Identification Numbers, topic), and audit information (channel quality, background noise).Switchboard Cellular Part 1 calls were transcribed using conventions similar to HUB5 English.
During the collection period, the LDC collected a total of 1,309 calls, or 2,618 sides (1957 GSM), from 254 participants (129 Male, 125 Female) under varied environmental conditions, of which 250 calls were transcribed.
Each speech file consists of a 1,024-byte ASCII-formatted Sphere header, followed by two-channel interleaved mu-law sample data. The mu-law samples represent the actual digital data transmission from the telephone service provider (MCI), as captured separately for each side of the telephone conversation by the LDC's telephone collection platform. The header also indicates the caller_pin, callee_pin, topic_id, cellular service/handset information and speaker demographic information.
The documentation also contains reports on clipped files.
For an example transcript please click here.
There are a total of 250 transcribed files, for a rough total of 12 hours of audio data, 1,431 Mbytes.