README FILE FOR LDC CATALOG ID: LDC2023S01

TITLE: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

AUTHORS: Dana Delgado, Kevin Walker, David Graff, Stephanie Strassel


1.0 Introduction

This package contains the AIDA Ukrainian Broadcast and Telephone Speech
Audio and Transcripts Corpus, which comprises approximately 156 hours of
Ukrainian conversational telephone speech (CTS) and broadcast news (BN)
with 1.2M words of corresponding orthographic transcripts. The BN
recordings in this corpus were collected to support the DARPA AIDA program,
and the CTS recordings were collected to support the NIST 2011 Language
Recognition Evaluation (LRE). The transcripts in this package were produced
to support the DARPA AIDA program.

The data in this corpus was originally released to AIDA program performers
as follows:
LDC2018E73 AIDA Ukrainian Broadcast and Telephone Speech Transcripts V2.0
LDC2018E74 AIDA Ukrainian Broadcast and Telephone Speech Audio V1.0

The CTS audio in this corpus also appears in this LDC catalog publication:
LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic
Group

2. Directory Structure and Content Summary

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

  data/
    audio/ -- contains audio files
    transcripts/ -- contains transcript files

  docs/ -- contains documentation about the files and transcription
           guidelines

2.1 Audio Content Summary

All audio files are in .flac format. For files with names like:
XXXXXX_XXXXXX_radiovesti_ukr, streaming audio was collected from:

Radio Vesti: http://212.26.132.60:8000/vesti_mp3

More information on the audio files can be found in docs/.

Audio content summary:

+------------------+----------------+------------------+
| genre            | files / segs   | duration (hours) |
+------------------+----------------+------------------+
| broadcast news   | 289 / 505      | 137.0            |
+------------------+----------------+------------------+
| telephone speech |  89 /  89      |  19.0            |
+------------------+----------------+------------------+
| total            | 378 / 594      | 156.0            |
+------------------+----------------+------------------+

Summary by source:

+----------------------+----------------+------------------+
| source               | files / segs   | duration (hours) |
+----------------------+----------------+------------------+
| VOA                  | 146 / 146      | 56.2             |
+----------------------+----------------+------------------+
| LDC CTS              | 89  / 89       | 19.0             |
+----------------------+----------------+------------------+
| Radio Vesti          | 6   / 17       | 5.3              |
+----------------------+----------------+------------------+
| Radio of Ukraine     |                |                  |
| (nrcu)               | 53  / 195      | 43.2             |
+----------------------+----------------+------------------+
| RFE/RL               |                |                  |
| (rfe,radiosvoboda,   |                |                  |
|  ukra)               | 16  / 37       | 11.7             |
+----------------------+----------------+------------------+
| LiveOnlineRadio.Net  |                |                  |
| (golosstolytsi)      | 5   / 10       | 3.8              |
+----------------------+----------------+------------------+
| Hromadske Radio      | 13  / 25       | 9.2              |
+----------------------+----------------+------------------+
| Radio Era            | 50  / 75       | 7.6              |
+----------------------+----------------+------------------+
| total                | 378 / 594      | 156.0            |
+----------------------+----------------+------------------+

2.2  Data Preparation

Native Ukrainian speakers manually segmented the data into sentence-level
units as part of the transcription process.

For broadcast audio, some files cover more than one report or "story". In 
these files, transcribers marked the start of the region where each new story
begins. Transcribers also marked untranscribed regions of advertisements or
music.

After transcription, native speaker annotators manually reviewed broadcast files
with extended regions of advertisements or music to confirm that these sections
should be automatically removed. Segmented versions of 87 broadcast news
recordings and their corresponding transcripts were then created to remove
these regions. The edited files appear in the data directory with a two-digit
segment number appended to the file-ID, e.g.:

 20170508_144101_nrcu_UR1_ukr_01.flac
 20170508_144101_nrcu_UR1_ukr_02.flac
 ...

Note that in six of these cases, the editing involved removing just one segment
at the beginning or end of the original recording, so there is only one audio file
present (with "_01" appended to the file name).

2.3  Audio File Collection and Processing

All CTS audio files were originally collected as 2-channel u-law and were
converted to 8KHz 16-bit pcm and flac compressed for release. All BN audio
files were originally collected as mp3 via web-downloaded or as live streaming
broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and
flac compressed for release.


2.4 Transcript Content Summary

+---------------------+-----------------+-------------+
| source              | files / parts   | words       |
+---------------------+-----------------+-------------+
| VOA                 | 146   / 146     | 382684      |
+-----------------------------------------------------+
| LDC CTS             | 89    / 89      | 197428      |
+---------------------+-----------------+-------------+
| Radio Vesti         | 6     / 17      | 36465       |
+---------------------+-----------------+-------------+
| Radio of Ukraine    |                 |             |
| (nrcu)              | 53    / 195     | 320209      |
+---------------------+-----------------+-------------+
| RFE/RL              |                 |             |
|(rfe,radiosvoboda,   |                 |             |
| ukra)               | 16    / 37      | 86348       |
+---------------------+-----------------+-------------+
| LiveOnlineRadio.Net |                 |             |
| (golosstolytsi)     | 5     / 10      | 27969       |
+---------------------+-----------------+-------------+
| Hromadske Radio     | 13    / 25      | 75181       |
+---------------------+-----------------+-------------+
| Radio Era           | 50    / 75      | 66281       |
+---------------------+-----------------+-------------+
| total               | 378  / 594      | 1,192,565   |
+---------------------+-----------------+-------------+

The number of total files (378) refers to the number of original recordings,
as recorded by the LDC.

Some of the broadcast news original recordings (and their transcripts) were
broken up into parts, in order to exclude commercial advertisements from the
release; the number of parts (594) refers to the total inventory of
transcript files.  (See section 2.2 above)

In a few cases -- 3 files from Radio of Ukraine / nrcu -- after the original
recording was broken up into parts, the initial part was found to contain no
Ukrainian speech data, and so has been left out of the corpus.  (These three
sets of "part" files have "_02", "_03" and "04", but no "_01".)

All transcripts are delivered as *.tsv tab delimited files with the following
fields:

      - start timestamp
      - end timestamp
      - speaker ID (speaker1, speaker2, etc.)
      - speaker sex (male, female, unknown)
      - transcript text

These files do not have headers -- the first row of each file is transcript
data -- but not all rows are transcript segments: in many broadcast news
transcript files, there are rows to mark story boundaries; these rows have no
time stamps or speaker information, and contain just the token "<story>" in
column 5 (transcript text).

3. Transcription mark-up

The following mark-up is used in the transcripts to indicate:
      %fp - filled pause
      %pw - partial word
      %noise - noise or speaker noise, such as a cough or sneeze
      <foreign> - non-Ukrainian speech of more than a few words
      (()) - unclear speech, best-guess transcription
      <story> - the beginning of a news story
      <commercial> - a segment containing an advertisement and/or music

For more information, see transcription guidelines
./docs/Ukrainian_Broadcast_Transcription_V3.1.pdf


4.  Documentation

The following documents are present in the docs/ directory of this package:

4.1  audio_info.tab

Four-column tab-delimited table providing information about each audio file;
the columns are:

   1   filename  -- includes 2-digit segment number for edited files
   2   channel_count -- 1 or 2
   3   sample_rate  -- samples per second
   4   duration_sec

The table has a separate row for each edited/segmented broadcast news audio
file.

4.2  audio_filelist.tab

Four-column tab-delimited table providing information about each original
recording; the columns are:

   1   filename  -- file-ID (minus 2-digit segment number in edited files)
   2   total_seconds -- summed over segments in edited files
   3   genre  -- "bcnews" or "cts"
   4   segmentation  -- "unedited" or list of segment numbers ("01,02...")

In the set of edited "bcnews" entries, the number of segments per file-ID
ranges from 1 (01) to 11.  In a few cases, a part number is followed by an
asterisk (e.g. "01*") to indicate that the corresponding audio "part" file has
been excluded from the corpus; this is because after the splitting was done,
one part was found to contain no Ukrainian speech (see section 2.4 above).

4.3 audio_md5sums.txt

checksum for each audio file

4.4 transcript_stats.tab

Six-column table providing summary information for each transcript file, as
follows:

    1   filename -- file-ID minus the ".tsv" extension
    2   span   -- total audio duration covered by the transcript (seconds)
    3   segsec -- total sum of speaker segment durations (seconds)
    4   segs   -- total number of speaker segments
    5   tkns   -- total number of word tokens in segments
    6   spkrs  -- total number of distinct speakers

Note that the difference between "span" and "segsec" reflects both gaps
between speaker turns and overlapping speaker turns.

4.5 Ukrainian_Broadcast_Transcription_V3.1.pdf

Guidelines/specifications used in creating the transcripts in this package.

Note that although 'Broadcast' is specified in the title of the guidelines,
for consistency, these guidelines were also used to produce the CTS transcripts.

5.  Known Issues

5.1 Bad time stamps in some broadcast news transcript files

As indicated in the transcription guidelines, annotators were instructed to
mark regions that contained commercial advertisements.  For this release, the
"commercial" portions have been edited out from both the audio and the
transcript files for the subset of broadcast news recordings that contained
commercials.

While preparing and carrying out the programmatic removal of commercials, we
observed a particular pattern of error in some of the transcripts.

In regions where two or more speakers in the broadcast were talking
simultaneously, we sometimes found that one of the speaker turns in the region
had an end-time marking that yielded an implausible duration for the turn --
that is, although a given speaker may have said only a few words (while others
were talking), and the start time for this utterance was fairly accurate, the
end time was set to an offset position that was dozens or even hundreds of
seconds later.  (This offset often correlated with the position of the next
turn for the given speaker, and we sometimes observed that word tokens from
that later utterance were incorrectly included as part of the overlapping
speech segment.)

Among the files that underwent removal of commercials, most or all cases of
this problem were repaired.  But the same basic symptom was also found in
some other broadcast news files, and those cases have not yet been fixed.

Users are advised to be watchful for speech segments in the transcript files
where the beginning and end time stamps indicate an unusually long turn, but
the text content is unexpectedly brief.  As mentioned above, this condition
tends to show up in regions where two or more speakers overlap.

5.2 Missing speaker data in some transcribed segments

There are 17 "voa_ukr" files that each contain a single transcribed segment that
lacks speaker information in columns 3 and 4; in each case, the utterance is
simply the phase 'В ефірі "Голос Америки"' (the "Voice of America" intro).

Apart from that set, some transcribed speech regions have empty cells for
speaker gender and/or speaker-ID.

While many advertisement regions have been edited out, many regions remain
that are marked with the transcript text value "<commercial>"; these rows in
the transcripts have time stamps but no speaker information.

5.3 Some transcript files have very little content

As indicated in the ./docs/transcript_stats.tab file, several transcript files
contain very few transcribed utterances, covering just a short time span in
the recording and having a very small token count.

6.  Acknowledgements

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and
FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily
reflect the views of DARPA.

7.  Copyright Information

Portions © 2017 Crimean Radio and Television Company, © 2017-2018 Hromadske Radio,
© 2017-2018 LiveOnlineRadio.Net, © 2017-2018 Radio of Ukraine, © 2017-2018 Radio Vesti,
© 2017-2018 RFE/RL, Inc., © 2022 Trustees of the University of Pennsylvania


8.  Contact Information

If you have questions about this data release, please contact the
following personnel at LDC.

Stephanie Strassel <strassel@ldc.upenn.edu> PI
Dana Delgado <foredana@ldc.upenn.edu> Project Manager
David Graff <graff@ldc.upenn.edu> Technical Lead

----
README created by Dana Delgado on March 7, 2022
README updated by Dana Delgado on March 16, 2022
README updated by Stephanie Strassel on March 21, 2022
README updated by Dana Delgado on March 23, 2022