README FILE FOR: LDC2024S01 CORPUS TITLE: KASET - Kurmanji and Sorani Kurdish Speech and Transcripts AUTHORS: Dana Delgado, Kevin Walker, Stephanie Strassel, David Graff, Christopher Caruso 1.0 Introduction This package contains the Kurmanji and Sorani Speech Transcripts (KASET) Corpus, which comprises approximately 147 hours of Kurdish conversational telephone speech (CTS) and broadcast news (BN) recordings in two Kurdish dialects: Kurmanji Kurdish and Sorani Kurdish. Approximately 60 hours of the collected recordings have been orthographically transcribed, yielding over 500,000 words. The KASET corpus was created by LDC to support speech technology research, development and evaluation for Kurdish. Native speakers of Kurmanji and Sorani Kurdish residing in the United States were recruited to make phone calls to multiple friends and family members, speaking for up to 10 minutes on a topic of their choosing. To supplement the CTS data, LDC collected additional streaming web broadcasts in Kurdish. All collected recordings were manually audited to confirm language, dialect and quality. A portion of the collected data was selected for verbatim orthographic transcription, following guidelines developed by LDC for this effort. The language codes used in directory names, file names and documentation tables are as follows.: ckb Sorani Kurdish (Central Kurdish) kmr Kurmanji Kurdish (Northern Kurdish) 2.0 Directory Structure The directory structure and contents of the package are summarized below; paths shown are relative to the base (root) directory of the package: ./docs/ -- contains this README, various tables and lists (see section 4 below), and PDF files detailing annotation guidelines: KASET_Kurmanji_TranscriptionGuidelines_v1.0.pdf KASET_Sorani_TranscriptionGuidelines_v1.0.pdf ./data/ audio/ broadcast/ ckb/ -- 497 *.flac files kmr/ -- 13 *.flac files telephone/ ckb/ -- 29 *.flac files kmr/ -- 260 *.flac files transcripts/ broadcast/ ckb/ -- 497 *.tsv files kmr/ -- 13 *.tsv files telephone/ ckb -- 18 *.tsv files kmr -- 54 *.tsv files 3.0 Collection Protocol 3.1 Conversational Telephone Speech (CTS) Native speakers of each variety residing in the continental US were recruited and enrolled as human subjects, providing informed consent and receiving compensation for their effort under a protocol approved by the University of Pennsylvania's Institutional Review Boards (IRB). Recruited subjects, known as callers, were required to make a minimum of 10 calls to different friends and family members residing in North America, with calls lasting up to 10 minutes. Both callers and callees provided consent prior to each recording. Callers provided basic demographic information and were assigned a unique, persistent PIN upon enrollment. Callees did not provide demographic information and were not assigned a PIN. Both callers and callees were self-reported native or highly fluent speakers of the target variety and were required to use that dialect for the duration of the call. Callers were permitted to make calls to the same callee up to 3 times. Calls were collected via LDC's robot-operator platform located in Philadelphia, containing a SIP trunk with 18 voice channels connecting to the public telephone network. Recruited callers dialed into the platform and entered their unique PIN for verification then used the telephone keypad to enter the callee's phone number. The platform dialed out to the callee and both speakers provided consent to be recorded. The call was then bridged and recording began. Recording automatically terminated after 10 minutes. Audio was captured as two-channel 8-bit μ-law with an 8-KHz sample rate. The collection platform also captured call metadata. 3.2 Broadcast News (BN) To supplement the CTS collection, we collected multiple streaming radio and television broadcast programs identified by native speakers as containing the target variety. Data included both narrowband and wideband audio and many programs contained a mix of Sorani and Kurmanji Kurdish. Audio was captured in its original encoding (aac or mp3) as a single-channel audio file with a 16-KHz sample rate. The collection platform also captured metadata about the recording. 4.0 Auditing and Transcription 4.1 Auditing and Selection of Data for Transcription Native speaker auditors reviewed all collected data to confirm that it met language and quality requirements. A portion of the collected, audited data was selected for transcription. For BN data, auditors manually identified the best 5-10 minute span from each recording for transcription. The selected spans were required to have speaker diversity but little or no overlapping speech, and were required to be entirely in the target variety (either all Sorani or all Kurmanji).For CTS data, auditors listen to portions of each call, confirming the quality and dialect and indicating speaker sex. 4.2 Transcription Full CTS recordings and the selected BN spans that passed audit were transcribed. Verbatim orthographic transcripts were created by native speakers following careful transcription guidelines developed by LDC for this effort. Transcribers first created virtual segments in the audio by timestamping structural boundaries consisting of sentence-type units (SUs) or breath/pause groups. Transcribers then produced a verbatim transcript for each segment, using special conventions to flag certain speech phenomena like disfluencies, foreign words or transcriber uncertainty. Transcribers were instructed to follow standard writing conventions, including standard word segmentation and word spelling, using Arabic script for Sorani and Latin script for Kurmanji. While the audio recordings contain a variety of Sorani and Kurmanji dialects, transcribers were required to use standard spelling conventions rather than trying to mimic the dialect pronunciation. Where multiple acceptable spellings for a word exist, transcribers followed Boltani dialect conventions for Kurmanji and Sulaymaniyah, Iraq dialect conventions for Sorani. Initial transcripts were reviewed and revised as needed in subsequent passess, including additional passes focused on spelling normalization. 5.0 Content Summary Conversational telephone speech audio is presented as two-channel audio with an 8-KHz sample rate, while broadcasts are presented as one-channel recordings with a 16-KHz sample rate. All audio is stored in flac-compressed format. This release contains full duration CTS recordings, while the broadcast data includes full duration recordings for some sources, and recordings of just the selected transcription spans for other sources. All transcript data is presented as tab-delimited, UTF-8 encoded tables, with four columns per row: 1. start offset (seconds) 2. end offset (seconds) 3. speaker label 4. transcript text The table below summarizes the data volume included in this release. In the table "audios" is the number of recorded files, "xscripts" is the number of transcript files, "aud_hours" refers to sum of audio file duration and "trn_hours" refers to the summed duration of transcribed audio. The "tokens" field reflects the number of space-separated tokens containing at least one alphabetic character, excluding any non-speech tokens and markup in the transcripts. lang genre audios xscrpts aud_hours trn_hours tokens ckb BN 497 497 91.7 50.43 393749 ckb CTS 29 18 4.9 3 19267 ckb total 526 515 96.6 53.43 413016 kmr BN 13 13 7.2 2.14 17803 kmr CTS 260 54 43.7 9.25 86873 kmr total 273 67 50.9 11.39 104676 all total 799 582 147.5 64.82 517692 6.0 Documentation Summary In addition to this README and the transcription guidelines for each variety, the docs/ directory contains the tables described in the following subsections. For each table, we list the column headings with their descriptions and/or an example value. 6.1 bn_audit.tab -- one row per broadcast audio file: 1 file_id 2 path - directory under data/audio/ 3 b_offs - seconds offset from start of file to start of transcript 4 e_offs - seconds offset from start of file to end of transcript 5 duration - seconds 6 source - e.g. SCOLA TRT6 7 audited_languages - e.g. Sorani 8 program_name - may be blank (as provided by auditors) 6.2 cts_audit.tab -- one row per telephone audio file: 1 file_id 2 duration - seconds 3 auditor_language - the intended language of the call (Kurmanji or Sorani) 4 language_code - language code for the intended language of the call (kmr or ckb) 5 audited_language - verified language of call (Kurmanji or Sorani) 6 callee_speaker_gender - male or female 6.3 subjects.tab -- one row per recruited telephone-call subject 1 PIN - e.g. 0288 2 sex - male or female 3 YOB - e.g. 1997 4 self_reported_language - e.g. Kurmanji Kurdish 6.4 transcript_info.tab -- one row per transcript file 1 file_id 2 path - directory under data/transcripts/ 3 b_offs - initial time stamp (seconds) relative to start of audio 4 e_offs - final time stamp (seconds) relative to start of audio 5 span - elapsed time between b_offs & e_offs 6 segdur - sum of transcript segment durations 7 nsegs - number of transcribed segments 8 ntkns - number of word tokens 9 nspkrs - number of distinct speakers 6.5 untranscribed_files.txt -- list of audio files lacking transcripts 7.0 Known Issues 7.1 Zero width non-joiner characters in many (but not all) Sorani transcripts There are 396 instances of the character U+200C "ZERO WIDTH NON-JOINER" (ZWNJ) in 108 of the Sorani transcripts. This is a special character whose function is to override the "default behavior" of Arabic text rendering rules for ligatures. (It is categorized as a "punctuation" character.) It may be worth noting that among the 284 distinct word forms containing ZWNJ, we find 107 cases where the overall transcript inventory also contains a word form that differs only in lacking the ZWNJ character(s). No attempt has been made to establish the relative "correctness" of using vs. omitting ZWNJ characters in the affected forms -- the forms are presented here as produced by the transcribers. 8.0 Copyright Information Portions © 2020 GKSAT, © 2020 Kurdistan24, © 2020 Rudaw, © 2020 SAHAR Universal Network, © 2020 Speda HD, © 2014 Star TV, © 2014, 2020 TRT Kurdi, © 2023 Trustees of the University of Pennsylvania 9.0 Contacts If you have questions about this data release, please contact the following personnel at LDC. Stephanie Strassel - PI Dana Delgado - Project Manager David Graff - Technical Lead ---------------------- README created by David Graff on May 10, 2023 updated by Dana Delgado on May 19, 2023 updated by Stephanie Strassel on July 10, 2023