GALE Phase 2 Arabic Broadcast News Transcripts - Part 2 Linguistic Data Consortium Authors: Meghan Glenn, Haejoong Lee, Stephanie Strassel, Kazuaki Maeda 1 Introduction This release comprises part 2 of GALE P2 Arabic Broadcast News Transcripts. Transcripts included in this release were created by LDC to support the GALE Program sponsored by DARPA. An annotation tool "Xtrans" was developed at LDC to support the transcription task. Corresponding audio data is released seperately. 2 Data Sources The broadcast news recordings used for transcription feature news broadcasts focusing principally on current events from the following sources: Abu Dhabi TV, based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Aljazeera , a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Dubai TV, based in Dubai, United Arab Emirates; Al Iraqiyah, a television network based in Iraq; Kuwait TV, a national television station based in Kuwait; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria. 3 Data Profile Language Data-type Genre Files Tokens Time(seconds) --------------------------------------------------------- Arabic(MSA) text BN 204 920730 551396.97 BN: stands for broadcast news There may be overlap between BN and BC content in a particular audio file. Our classification of a source program as BN or BC is meant to reflect the dominant genre. Token count is based on white-spaced words. 4 Transcription Annotation 4.1 Annotation Process and Guidelines Data go through one or more than one layer of annotation based on the type of transcription performed (described below). Regardless of the transcription method or data genre, data are either transcribed in-house or outsourced. This is determined based on timeline, volume and available resources. Whatever the case may be, the following guidelines are followed consistently. Quick transcripts (QTR): quick (near-)verbatim, time-aligned transcripts plus speakerID with minimal additional markup; created by LDC and/or professional transcription agencies. Transcripts do not include SU (sentence-unit) annotations. Data are either fully outsourced for one pass of transcription, with no additional annotation performed; or are fully transcribed in-house for one pass. Quick rich transcripts (QRTR): quick (near-)verbatim, time-aligned transcripts with minimal markup, plus speakerID and SU (sentence-unit) annotations; created by LDC and/or professional transcription agencies. First pass annotation is either outsourced or completed in-house. Second (and in some cases third) pass transcriptions are needed for this type of data, regardless of location of first pass annotation. Second (and sometimes very quick third) pass annotation is always completed in-house by more senior annotators to ensure quality. Transcribed data with QTR as part of filenames indicate the transcription annotation is performed in quick transcription style, while filenames containing QRTR indicating a rich and careful transription annotation style. Copies of transcription guidelines are included in the docs directory. Also, detailed QTR and QRTR transcription guidelines are provided on LDC's GALE website at: http://www.ldc.upenn.edu/Projects/GALE/Transcription 4.2 Annotation Tool Transcription annotation is done via the tool XTrans, which is a next generation multi-platform, multilingual, multi-channel transcription tool developed by LDC to support manual transcription and annotation of audio recordings. Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics. The tool is free-downloadable from the following link: http://www.ldc.upenn.edu/tools/XTrans/downloads/ 5 Data format 5.1 Transcription file name conventions Transcription files are named as follows. __ARB__