AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

Item Name: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
Author(s): Dana Delgado, Kevin Walker, David Graff, Stephanie Strassel
LDC Catalog No.: LDC2023S01
ISLRN: 699-485-644-732-3
DOI: https://doi.org/10.35111/qge4-4f15
Release Date: January 17, 2023
Member Year(s): 2023
DCMI Type(s): Sound, Text
Sample Type: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm
Sample Rate: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm
Data Source(s): broadcast news, telephone conversations
Project(s): AIDA, LORELEI , NIST LRE
Application(s): speech recognition
Language(s): Ukrainian
Language ID(s): ukr
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2023S01 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Delgado, Dana, et al. AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts LDC2023S01. Web Download. Philadelphia: Linguistic Data Consortium, 2023.
Related Works: View

Introduction

AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts.

The broadcast recordings and transcripts were produced to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

The telephone speech audio recordings were collected to support the NIST 2011 Language Recognition Evaluation  which focused on pair discrimination for 24 languages/dialects. These recording are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11. The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field.

Data

The CTS audio data was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. It was collected using LDC's telephone infrastructure comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release.

The BN data was taken from 87 news recordings broadcast by various Ukrainian sources. All BN audio files were originally collected as mp3 via web-download or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release.

Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. All transcripts are delivered as *.tsv tab delimited files that include metadata and statistics.

Sponsorship

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.

Samples

Please view these samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee