README FILE FOR THE EXPANDED CALLFRIEND FARSI TELEPHONE SPEECH CORPUS LDC Catalog-ID: LDC2014S01 1. Introduction and Background This corpus contains audio recordings of 100 telephone conversations among native speakers of Farsi. These calls were recorded by the Linguistic Data Consortium in 1995-6 as part of the CallFriend (CF) collection, which was designed primarily to support research in automatic language identification. One hundred native Farsi speakers living the the continental U.S. were recruited and offered incentives to make a single phone call, lasting up to 30 minutes, to a family member or friend living anywhere else in the U.S. Audio data for 60 of the calls were released, without transcripts, in the LDC's 1996 membership year (corpus catalog-ID LDC96S50), after 20 of these calls had been used as evaluation data in the first NIST Language Recognition Evaluation (LRE) in 1996; the remaining 40 were held in reserve for use in later LRE test sets, in 2003 and 2005. All CF recordings involved domestic calls routed through the LDC's call collection platform, and were stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. In 2000-1, the LDC employed a small group of Farsi speakers to transcribe the 100 CF Farsi calls, to support research in automatic speech recognition. Those transcripts are being released as a separate corpus, LDC2014T01. 2. Directory Structure docs/ -- contains one file: call_info.tab -- inventory of calls in the corpus data/ -- contains 100 audio data files (fa_####.flac) In all the data files, the four-digit portion of the file name is a numeric call-ID, used across all forms of data (text and audio) from a given conversation. 3. File Formats 3.1 call_info.tab This is a plain-text, tab-delimited table containing column headings on the first line in the file, followed by one line (or row) for each call in the corpus. The columns are as follows: 1 file_id (fa_####) 2 gender (male, female or mixed) 3 A_nspk (number of speakers on channel A) 4 B_nspk (number of speakers on channel B) 5 seconds (duration of recorded call, in seconds) The number of speakers per channel represents how many distinct speaker labels were assigned to speech segments when the call was transcribed. When all speakers appearing in the transcript were of the same gender, the "gender" column in this table shows either "male" or "female"; otherwise, this column shows "mixed". 3.2 audio files (fa_*.flac) Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data. The conversion from the original 8-bit mu-law samples to 16-bit PCM has been done in a way that preserves the original mu-law distinction between "positive zero" (0xff samples) and "negative zero" (0x7f samples): in particular, the "negative zero" samples are rendered by a PCM value of -1, instead of 0 (whereas all the common ulaw-to-pcm conversion methods, e.g. Sox, convert both mu-law zeros to PCM 0). This makes it possible to recreate the exact, original mu-law streams from the PCM sample data. (For all but a few research concerns, this distinction is negligible.) 4. Known Issues The speaker counts in call_info.tab are based on the speaker labels in the corresponding transcript files (released in LDC2014T01). The transcript for one of the calls, fa_7003, was incomplete: only a portion of one side of the call was transcribed. Since that transcript file contains no speaker labels for channel B, the "B_nspk" column in call_info.tbl shows "0" for fa_7003, even though there is certainly at least one speaker on channel B of that call. ------------------ README file created by David Graff, July 29, 2013.