AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
Item Name: | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts |
Author(s): | Dana Delgado, Kevin Walker, David Graff, Stephanie Strassel |
LDC Catalog No.: | LDC2023S01 |
ISLRN: | 699-485-644-732-3 |
DOI: | https://doi.org/10.35111/qge4-4f15 |
Release Date: | January 17, 2023 |
Member Year(s): | 2023 |
DCMI Type(s): | Sound, Text |
Sample Type: | CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm |
Sample Rate: | CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm |
Data Source(s): | broadcast news, telephone conversations |
Project(s): | AIDA, LORELEI , NIST LRE |
Application(s): | speech recognition |
Language(s): | Ukrainian |
Language ID(s): | ukr |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2023S01 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Delgado, Dana, et al. AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts LDC2023S01. Web Download. Philadelphia: Linguistic Data Consortium, 2023. |
Related Works: | View |
Introduction
AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts.
The broadcast recordings and transcripts were produced to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.
The telephone speech audio recordings were collected to support the NIST 2011 Language Recognition Evaluation which focused on pair discrimination for 24 languages/dialects. These recording are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11. The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field.
Data
The CTS audio data was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. It was collected using LDC's telephone infrastructure comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release.
The BN data was taken from 87 news recordings broadcast by various Ukrainian sources. All BN audio files were originally collected as mp3 via web-download or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release.
Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. All transcripts are delivered as *.tsv tab delimited files that include metadata and statistics.
Sponsorship
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.
Samples
Please view these samples:
Updates
None at this time.