README FILE FOR: LDC2024S01
CORPUS TITLE:    KASET - Kurmanji and Sorani Kurdish Speech and Transcripts
AUTHORS: Dana Delgado, Kevin Walker, Stephanie Strassel, David Graff,
Christopher Caruso

1.0 Introduction

This package contains the Kurmanji and Sorani Speech Transcripts (KASET)
Corpus, which comprises approximately 147 hours of Kurdish conversational
telephone speech (CTS) and broadcast news (BN) recordings in two Kurdish
dialects: Kurmanji Kurdish and Sorani Kurdish. Approximately 60 hours of
the collected recordings have been orthographically transcribed, yielding
over 500,000 words. 

The KASET corpus was created by LDC to support speech technology research,
development and evaluation for Kurdish. Native speakers of Kurmanji and
Sorani Kurdish residing in the United States were recruited to make phone
calls to multiple friends and family members, speaking for up to 10 minutes
on a topic of their choosing. To supplement the CTS data, LDC collected
additional streaming web broadcasts in Kurdish. All collected recordings
were manually audited to confirm language, dialect and quality. A portion
of the collected data was selected for verbatim orthographic transcription,
following guidelines developed by LDC for this effort. The language codes
used in directory names, file names and documentation tables are as
follows.:

  ckb	Sorani Kurdish (Central Kurdish)
  kmr	Kurmanji Kurdish (Northern Kurdish)

2.0 Directory Structure

The directory structure and contents of the package are summarized below;
paths shown are relative to the base (root) directory of the package:

  ./docs/
     -- contains this README, various tables and lists (see section 4
  below),
        and PDF files detailing annotation guidelines:
           KASET_Kurmanji_TranscriptionGuidelines_v1.0.pdf
           KASET_Sorani_TranscriptionGuidelines_v1.0.pdf

  ./data/
     audio/
        broadcast/
           ckb/  -- 497 *.flac files
           kmr/  --  13 *.flac files
        telephone/
           ckb/  --  29 *.flac files
           kmr/  -- 260 *.flac files
     transcripts/
        broadcast/
           ckb/  -- 497 *.tsv files
           kmr/  --  13 *.tsv files
        telephone/
           ckb   --  18 *.tsv files
           kmr   --  54 *.tsv files

3.0 Collection Protocol

3.1 Conversational Telephone Speech (CTS)

Native speakers of each variety residing in the continental US were
recruited and enrolled as human subjects, providing informed consent and
receiving compensation for their effort under a protocol approved by the
University of Pennsylvania's Institutional Review Boards (IRB). Recruited
subjects, known as callers, were required to make a minimum of 10 calls to
different friends and family members residing in North America, with calls
lasting up to 10 minutes. Both callers and callees provided consent prior
to each recording. Callers provided basic demographic information and were
assigned a unique, persistent PIN upon enrollment. Callees did not provide
demographic information and were not assigned a PIN. Both callers and
callees were self-reported native or highly fluent speakers of the target
variety and were required to use that dialect for the duration of the
call. Callers were permitted to make calls to the same callee up to 3
times. 

Calls were collected via LDC's robot-operator platform located in
Philadelphia, containing a SIP trunk with 18 voice channels connecting to
the public telephone network. Recruited callers dialed into the platform
and entered their unique PIN for verification then used the telephone
keypad to enter the callee's phone number. The platform dialed out to the
callee and both speakers provided consent to be recorded. The call was then
bridged and recording began. Recording automatically terminated after 10
minutes. Audio was captured as two-channel 8-bit μ-law with an 8-KHz sample
rate. The collection platform also captured call metadata. 

3.2 Broadcast News (BN)

To supplement the CTS collection, we collected multiple streaming radio and
television broadcast programs identified by native speakers as containing
the target variety. Data included both narrowband and wideband audio and
many programs contained a mix of Sorani and Kurmanji Kurdish. Audio was
captured in its original encoding (aac or mp3) as a single-channel audio
file with a 16-KHz sample rate. The collection platform also captured
metadata about the recording.

4.0 Auditing and Transcription

4.1 Auditing and Selection of Data for Transcription

Native speaker auditors reviewed all collected data to confirm that it met
language and quality requirements. A portion of the collected, audited data
was selected for transcription. For BN data, auditors manually identified
the best 5-10 minute span from each recording for transcription. The
selected spans were required to have speaker diversity but little or no
overlapping speech, and were required to be entirely in the target variety
(either all Sorani or all Kurmanji).For CTS data, auditors listen to
portions of each call, confirming the quality and dialect and indicating
speaker sex. 

4.2 Transcription

Full CTS recordings and the selected BN spans that passed audit were
transcribed. Verbatim orthographic transcripts were created by native
speakers following careful transcription guidelines developed by LDC for
this effort. Transcribers first created virtual segments in the audio by
timestamping structural boundaries consisting of sentence-type units (SUs)
or breath/pause groups. Transcribers then produced a verbatim transcript
for each segment, using special conventions to flag certain speech
phenomena like disfluencies, foreign words or transcriber
uncertainty. Transcribers were instructed to follow standard writing
conventions, including standard word segmentation and word spelling, using
Arabic script for Sorani and Latin script for Kurmanji. While the audio
recordings contain a variety of Sorani and Kurmanji dialects, transcribers
were required to use standard spelling conventions rather than trying to
mimic the dialect pronunciation. Where multiple acceptable spellings for a
word exist, transcribers followed Boltani dialect conventions for Kurmanji
and Sulaymaniyah, Iraq dialect conventions for Sorani. Initial transcripts
were reviewed and revised as needed in subsequent passess, including
additional passes focused on spelling normalization.

5.0 Content Summary

Conversational telephone speech audio is presented as two-channel audio
with an 8-KHz sample rate, while broadcasts are presented as one-channel
recordings with a 16-KHz sample rate. All audio is stored in
flac-compressed format. This release contains full duration CTS recordings,
while the broadcast data includes full duration recordings for some
sources, and recordings of just the selected transcription spans for other
sources.
 
All transcript data is presented as tab-delimited, UTF-8 encoded tables,
with four columns per row:

  1. start offset (seconds)
  2. end offset (seconds)
  3. speaker label
  4. transcript text

The table below summarizes the data volume included in this release. In the
table "audios" is the number of recorded files, "xscripts" is the number of
transcript files, "aud_hours" refers to sum of audio file duration and
"trn_hours" refers to the summed duration of transcribed audio. The
"tokens" field reflects the number of space-separated tokens containing at
least one alphabetic character, excluding any non-speech tokens and markup
in the transcripts.

lang   genre	audios	xscrpts	aud_hours	trn_hours	tokens
ckb    BN	497	497	91.7		50.43		393749
ckb    CTS	29	18	4.9		3		19267
ckb    total	526	515	96.6		53.43		413016
kmr    BN	13	13	7.2		2.14		17803
kmr    CTS	260	54	43.7		9.25		86873
kmr    total	273	67	50.9		11.39		104676
all    total	799	582	147.5		64.82		517692

6.0 Documentation Summary

In addition to this README and the transcription guidelines for each
variety, the docs/ directory contains the tables described in the following
subsections.  For each table, we list the column headings with their
descriptions and/or an example value.

6.1 bn_audit.tab -- one row per broadcast audio file:
    1 file_id
    2 path - directory under data/audio/
    3 b_offs - seconds offset from start of file to start of transcript
    4 e_offs - seconds offset from start of file to end of transcript
    5 duration - seconds
    6 source - e.g. SCOLA TRT6
    7 audited_languages - e.g. Sorani
    8 program_name - may be blank (as provided by auditors)

6.2 cts_audit.tab -- one row per telephone audio file:
    1 file_id
    2 duration - seconds
    3 auditor_language - the intended language of the call (Kurmanji or
    Sorani)
    4 language_code - language code for the intended language of the call
    (kmr or ckb)
    5 audited_language - verified language of call (Kurmanji or Sorani)
    6 callee_speaker_gender - male or female

6.3 subjects.tab -- one row per recruited telephone-call subject
    1 PIN - e.g. 0288
    2 sex - male or female
    3 YOB - e.g. 1997
    4 self_reported_language - e.g. Kurmanji Kurdish

6.4 transcript_info.tab -- one row per transcript file
    1 file_id
    2 path - directory under data/transcripts/
    3 b_offs - initial time stamp (seconds) relative to start of audio
    4 e_offs - final time stamp (seconds) relative to start of audio
    5 span - elapsed time between b_offs & e_offs
    6 segdur - sum of transcript segment durations
    7 nsegs - number of transcribed segments
    8 ntkns - number of word tokens
    9 nspkrs  - number of distinct speakers

6.5 untranscribed_files.txt -- list of audio files lacking transcripts

7.0 Known Issues

7.1 Zero width non-joiner characters in many (but not all) Sorani
transcripts

There are 396 instances of the character U+200C "ZERO WIDTH NON-JOINER"
(ZWNJ) in 108 of the Sorani transcripts.  This is a special character whose
function is to override the "default behavior" of Arabic text rendering
rules for ligatures.  (It is categorized as a "punctuation" character.)

It may be worth noting that among the 284 distinct word forms containing
ZWNJ, we find 107 cases where the overall transcript inventory also
contains a word form that differs only in lacking the ZWNJ character(s).
No attempt has been made to establish the relative "correctness" of using
vs. omitting ZWNJ characters in the affected forms -- the forms are
presented here as produced by the transcribers.

8.0 Copyright Information

Portions © 2020 GKSAT, © 2020 Kurdistan24, © 2020 Rudaw, © 2020 SAHAR
Universal Network, © 2020 Speda HD, © 2014 Star TV, © 2014, 2020 TRT Kurdi,
© 2023 Trustees of the University of Pennsylvania

9.0 Contacts

If you have questions about this data release, please contact the
following personnel at LDC.

Stephanie Strassel - PI <strassel@ldc.upenn.edu>
Dana Delgado - Project Manager <foredana@ldc.upenn.edu>
David Graff - Technical Lead <graff@ldc.upenn.edu>

----------------------
README created by David Graff on May 10, 2023
       updated by Dana Delgado on May 19, 2023
       updated by Stephanie Strassel on July 10, 2023