Introduction

This corpus consists of Voice of America radio broadcasts in Turkish and is part of a larger corpus of Turkish broadcast news data collected and transcribed for research purposes. The main goal is to facilitate research in Turkish automatic speech recognition and its applications, such as speech retrieval. For data collection, a simple PC and TV/radio card setup was utilized and a quick manual segmentation and transcription approach was followed.

Speech recognition and retrieval experiments using a part of the larger corpus can be found in the following journal article:
Ebru Arisoy, Dogan Can, Siddika Parlak, Hasim Sak, and Murat Saraclar, "Turkish Broadcast News Transcription and Retrieval,"
IEEE Transactions on Audio, Speech and Language Processing, 17(5):874-883, July 2009.

For more information please visit http://busim.ee.boun.edu.tr/~speech or contact the principal investigator, Murat Saraçlar.

Data

The audio data was collected between December 2006 and June 2009. 
The data collected between 2006 and 2008 come from analog FM radio, whereas the data collected in 2009 come from digital satellite transmission.

The source data is broadcast news data recorded from radio using a TV/radio card. The data was recorded at 32 KHz and resampled at 16 KHz. After screening for recording quality, these recordings are segmented, transcribed, and verified. The segmentation follows a two-step procedure where an initial automatic segmentation step is followed by a manual correction and annotation step. Information such as background conditions and speaker boundaries are also given at this manual step.

There are a total of 254 audio files, and each of them is 30 minutes long. The total amount of transcribed speech included in the corpus is approximately 107 hours.

The transcription guidelines were adapted from the LDC Hub4 and quick transcription guidelines. An English version of the guidelines is provided with the data. The manual segmentations and transcripts were created by native Turkish speakers at Boğaziçi University using the Transcriber software (http://trans.sourceforge.net).
The transcriptions are provided in the ISO-8859-9 (Latin5) character set.
Sponsorship

Funding for this corpus collection effort came from TUBITAK Project 105E102 and Bogazici University Research Fund Project 05HA202.

Acknowledgments

PhD Students: Ebru Arısoy, Haşim Sak 
MS Students: Ç. Kayra Akman, Tuncay Aksungurlu, Doğan Can, Erinç Dikici, Sıddıka Parlak, Temuçin Som

Data collection and organization: İpek Şen
Segmentation: Esra Çınar, ...
Transcription: Nihan Sefer, Filiz Carus, Burcu Güre, Zeynep Arıkan, Işıl Aracı, ...

The automatic segmentation software was provided by SESTEK (http://www.sestek.com.tr)