README FILE FOR LDC CATALOG ID: LDC2023S01 TITLE: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts AUTHORS: Dana Delgado, Kevin Walker, David Graff, Stephanie Strassel 1.0 Introduction This package contains the AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts Corpus, which comprises approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news (BN) with 1.2M words of corresponding orthographic transcripts. The BN recordings in this corpus were collected to support the DARPA AIDA program, and the CTS recordings were collected to support the NIST 2011 Language Recognition Evaluation (LRE). The transcripts in this package were produced to support the DARPA AIDA program. The data in this corpus was originally released to AIDA program performers as follows: LDC2018E73 AIDA Ukrainian Broadcast and Telephone Speech Transcripts V2.0 LDC2018E74 AIDA Ukrainian Broadcast and Telephone Speech Audio V1.0 The CTS audio in this corpus also appears in this LDC catalog publication: LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group 2. Directory Structure and Content Summary The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: data/ audio/ -- contains audio files transcripts/ -- contains transcript files docs/ -- contains documentation about the files and transcription guidelines 2.1 Audio Content Summary All audio files are in .flac format. For files with names like: XXXXXX_XXXXXX_radiovesti_ukr, streaming audio was collected from: Radio Vesti: More information on the audio files can be found in docs/. Audio content summary: +------------------+----------------+------------------+ | genre | files / segs | duration (hours) | +------------------+----------------+------------------+ | broadcast news | 289 / 505 | 137.0 | +------------------+----------------+------------------+ | telephone speech | 89 / 89 | 19.0 | +------------------+----------------+------------------+ | total | 378 / 594 | 156.0 | +------------------+----------------+------------------+ Summary by source: +----------------------+----------------+------------------+ | source | files / segs | duration (hours) | +----------------------+----------------+------------------+ | VOA | 146 / 146 | 56.2 | +----------------------+----------------+------------------+ | LDC CTS | 89 / 89 | 19.0 | +----------------------+----------------+------------------+ | Radio Vesti | 6 / 17 | 5.3 | +----------------------+----------------+------------------+ | Radio of Ukraine | | | | (nrcu) | 53 / 195 | 43.2 | +----------------------+----------------+------------------+ | RFE/RL | | | | (rfe,radiosvoboda, | | | | ukra) | 16 / 37 | 11.7 | +----------------------+----------------+------------------+ | LiveOnlineRadio.Net | | | | (golosstolytsi) | 5 / 10 | 3.8 | +----------------------+----------------+------------------+ | Hromadske Radio | 13 / 25 | 9.2 | +----------------------+----------------+------------------+ | Radio Era | 50 / 75 | 7.6 | +----------------------+----------------+------------------+ | total | 378 / 594 | 156.0 | +----------------------+----------------+------------------+ 2.2 Data Preparation Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. For broadcast audio, some files cover more than one report or "story". In these files, transcribers marked the start of the region where each new story begins. Transcribers also marked untranscribed regions of advertisements or music. After transcription, native speaker annotators manually reviewed broadcast files with extended regions of advertisements or music to confirm that these sections should be automatically removed. Segmented versions of 87 broadcast news recordings and their corresponding transcripts were then created to remove these regions. The edited files appear in the data directory with a two-digit segment number appended to the file-ID, e.g.: 20170508_144101_nrcu_UR1_ukr_01.flac 20170508_144101_nrcu_UR1_ukr_02.flac ... Note that in six of these cases, the editing involved removing just one segment at the beginning or end of the original recording, so there is only one audio file present (with "_01" appended to the file name). 2.3 Audio File Collection and Processing All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release. All BN audio files were originally collected as mp3 via web-downloaded or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release. 2.4 Transcript Content Summary +---------------------+-----------------+-------------+ | source | files / parts | words | +---------------------+-----------------+-------------+ | VOA | 146 / 146 | 382684 | +-----------------------------------------------------+ | LDC CTS | 89 / 89 | 197428 | +---------------------+-----------------+-------------+ | Radio Vesti | 6 / 17 | 36465 | +---------------------+-----------------+-------------+ | Radio of Ukraine | | | | (nrcu) | 53 / 195 | 320209 | +---------------------+-----------------+-------------+ | RFE/RL | | | |(rfe,radiosvoboda, | | | | ukra) | 16 / 37 | 86348 | +---------------------+-----------------+-------------+ | LiveOnlineRadio.Net | | | | (golosstolytsi) | 5 / 10 | 27969 | +---------------------+-----------------+-------------+ | Hromadske Radio | 13 / 25 | 75181 | +---------------------+-----------------+-------------+ | Radio Era | 50 / 75 | 66281 | +---------------------+-----------------+-------------+ | total | 378 / 594 | 1,192,565 | +---------------------+-----------------+-------------+ The number of total files (378) refers to the number of original recordings, as recorded by the LDC. Some of the broadcast news original recordings (and their transcripts) were broken up into parts, in order to exclude commercial advertisements from the release; the number of parts (594) refers to the total inventory of transcript files. (See section 2.2 above) In a few cases -- 3 files from Radio of Ukraine / nrcu -- after the original recording was broken up into parts, the initial part was found to contain no Ukrainian speech data, and so has been left out of the corpus. (These three sets of "part" files have "_02", "_03" and "04", but no "_01".) All transcripts are delivered as *.tsv tab delimited files with the following fields: - start timestamp - end timestamp - speaker ID (speaker1, speaker2, etc.) - speaker sex (male, female, unknown) - transcript text These files do not have headers -- the first row of each file is transcript data -- but not all rows are transcript segments: in many broadcast news transcript files, there are rows to mark story boundaries; these rows have no time stamps or speaker information, and contain just the token "" in column 5 (transcript text). 3. Transcription mark-up The following mark-up is used in the transcripts to indicate: %fp - filled pause %pw - partial word %noise - noise or speaker noise, such as a cough or sneeze - non-Ukrainian speech of more than a few words (()) - unclear speech, best-guess transcription - the beginning of a news story - a segment containing an advertisement and/or music For more information, see transcription guidelines ./docs/Ukrainian_Broadcast_Transcription_V3.1.pdf 4. Documentation The following documents are present in the docs/ directory of this package: 4.1 Four-column tab-delimited table providing information about each audio file; the columns are: 1 filename -- includes 2-digit segment number for edited files 2 channel_count -- 1 or 2 3 sample_rate -- samples per second 4 duration_sec The table has a separate row for each edited/segmented broadcast news audio file. 4.2 Four-column tab-delimited table providing information about each original recording; the columns are: 1 filename -- file-ID (minus 2-digit segment number in edited files) 2 total_seconds -- summed over segments in edited files 3 genre -- "bcnews" or "cts" 4 segmentation -- "unedited" or list of segment numbers ("01,02...") In the set of edited "bcnews" entries, the number of segments per file-ID ranges from 1 (01) to 11. In a few cases, a part number is followed by an asterisk (e.g. "01*") to indicate that the corresponding audio "part" file has been excluded from the corpus; this is because after the splitting was done, one part was found to contain no Ukrainian speech (see section 2.4 above). 4.3 audio_md5sums.txt checksum for each audio file 4.4 Six-column table providing summary information for each transcript file, as follows: 1 filename -- file-ID minus the ".tsv" extension 2 span -- total audio duration covered by the transcript (seconds) 3 segsec -- total sum of speaker segment durations (seconds) 4 segs -- total number of speaker segments 5 tkns -- total number of word tokens in segments 6 spkrs -- total number of distinct speakers Note that the difference between "span" and "segsec" reflects both gaps between speaker turns and overlapping speaker turns. 4.5 Ukrainian_Broadcast_Transcription_V3.1.pdf Guidelines/specifications used in creating the transcripts in this package. Note that although 'Broadcast' is specified in the title of the guidelines, for consistency, these guidelines were also used to produce the CTS transcripts. 5. Known Issues 5.1 Bad time stamps in some broadcast news transcript files As indicated in the transcription guidelines, annotators were instructed to mark regions that contained commercial advertisements. For this release, the "commercial" portions have been edited out from both the audio and the transcript files for the subset of broadcast news recordings that contained commercials. While preparing and carrying out the programmatic removal of commercials, we observed a particular pattern of error in some of the transcripts. In regions where two or more speakers in the broadcast were talking simultaneously, we sometimes found that one of the speaker turns in the region had an end-time marking that yielded an implausible duration for the turn -- that is, although a given speaker may have said only a few words (while others were talking), and the start time for this utterance was fairly accurate, the end time was set to an offset position that was dozens or even hundreds of seconds later. (This offset often correlated with the position of the next turn for the given speaker, and we sometimes observed that word tokens from that later utterance were incorrectly included as part of the overlapping speech segment.) Among the files that underwent removal of commercials, most or all cases of this problem were repaired. But the same basic symptom was also found in some other broadcast news files, and those cases have not yet been fixed. Users are advised to be watchful for speech segments in the transcript files where the beginning and end time stamps indicate an unusually long turn, but the text content is unexpectedly brief. As mentioned above, this condition tends to show up in regions where two or more speakers overlap. 5.2 Missing speaker data in some transcribed segments There are 17 "voa_ukr" files that each contain a single transcribed segment that lacks speaker information in columns 3 and 4; in each case, the utterance is simply the phase 'В ефірі "Голос Америки"' (the "Voice of America" intro). Apart from that set, some transcribed speech regions have empty cells for speaker gender and/or speaker-ID. While many advertisement regions have been edited out, many regions remain that are marked with the transcript text value ""; these rows in the transcripts have time stamps but no speaker information. 5.3 Some transcript files have very little content As indicated in the ./docs/ file, several transcript files contain very few transcribed utterances, covering just a short time span in the recording and having a very small token count. 6. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 7. Copyright Information Portions © 2017 Crimean Radio and Television Company, © 2017-2018 Hromadske Radio, © 2017-2018 LiveOnlineRadio.Net, © 2017-2018 Radio of Ukraine, © 2017-2018 Radio Vesti, © 2017-2018 RFE/RL, Inc., © 2022 Trustees of the University of Pennsylvania 8. Contact Information If you have questions about this data release, please contact the following personnel at LDC. Stephanie Strassel PI Dana Delgado Project Manager David Graff Technical Lead ---- README created by Dana Delgado on March 7, 2022 README updated by Dana Delgado on March 16, 2022 README updated by Stephanie Strassel on March 21, 2022 README updated by Dana Delgado on March 23, 2022