README FILE FOR LDC CATALOG ID: LDC2025S05 TITLE: IWSLT22 and IWSLT23 Tunisian Arabic Shared Task Training, Development and Test Data AUTHORS: Michael Arrigo, Dana Delgado, Stephanie Strassel, David Graff 1.0 Introduction This release contains data used for system training, development and testing in the International Conference on Spoken Language Translation (IWSLT) 2022 Dialectal Speech Translation Task and the 2023 Dialectal and Low-Resource Speech Translation Task. Data in this release is split into five partitions, with each split containing audio, transcripts and translations. The train, dev and test 1 splits were used by participants in both IWSLT22 and IWSLT23 to build and develop their systems, while the test2 and test3 partitions represent the official evaluation data for IWSLT22 and IWSLT23 respectively. The data comprises audio and transcriptions of Tunisian Arabic (TA) audio along with translations into English. The TA audio in this release consists of conversational telephone speech (CTS) recordings collected from consented human subjects in Tunis, via a robot operator system which recorded digital sample data directly from the regional public telephone network. The transcripts are presented as tab-delimited tables containing UTF8-encoded Arabic script, and translations are presented in parallel fashion (including some UTF-8 encoded non-ASCII characters, such as accented vowels). The TA CTS audio, transcripts and translations are stored as pairs of single-channel files representing the two sides ("A" and "B") of each conversation. Channel/side A is a "claque" speaker (someone recruited to make calls) and channel/side B is a "callee" (a conversation partner of the claque's choosing who also consented to be recorded). In total, this release contains 210.08 hours of 2-channel CTS audio recordings in 1188 conversations, with transcripts covering nearly 175 hours of time-stamped speech segments, with English translations for all transcripts. 2.0 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./README.txt -- this file ./data/ ./dev/audio ./dev/transcripts ./dev/translations ./test1/audio ./test1/transcripts ./test1/translations ./test2/audio ./test2/transcripts ./test2/translations ./test3/audio ./test3/transcripts ./test3/translations ./train/audio ./train/transcripts ./train/translations ./docs/ callid_audiofile_subjectid_map.tab CMN2_Tunisian_Arabic_CTS_Transcription_Appendix_Function_Words_V5.0.pdf CMN2_Tunisian_Arabic_CTS_Transcription_Guidelines_V6.0.pdf cts_subjects.tab cts_token-initial_diacritics.tab partitions.tab Tunisian_Arabic_CTS_Translation_Guidelines_V1.1.pdf 3.0 Contents 3.1 Audio The files in ./data/{dev|test1|test2|test3|train}/audio/ are FLAC-compressed MS-WAV files containing 16-bit PCM data originally recorded at 8000 samples/sec. 3.2 Transcription The table below shows the number of segments, hours, and files provided in this release: +-----------+---------+---------+---------+ | type | n_segs | n_hours | n_files | +-----------+---------+---------+---------+ | TA / CTS | 219,481 | 174.58 | 2,376 | +-----------+---------+---------+---------+ | total | 219,481 | 174.58 | 2,376 | +-----------+---------+---------+---------+ Note that the durations above are based on the speech segments alone and do not include durations for nonspeech segments. 3.3 Translation The table below shows the number of segments, hours, and files for the translation files in this release: +-----------+---------+---------+---------+ | type | n_segs | n_hours | n_files | +-----------+---------+---------+---------+ | TA / CTS | 219,481 | 174.58 | 2,376 | +-----------+---------+---------+---------+ | total | 219,481 | 174.58 | 2,376 | +-----------+---------+---------+---------+ Note that the durations above are based on the speech segments alone and do not include durations for nonspeech segments. 3.3 Documentation The ./docs/callid_audiofile_subjectid_map.tab file contains a mapping of the call ID, the corresponding audio file, and the subject ID of the speaker for each audio file in this release. The subject ID of the callee (B) side for each call is the call ID plus an initial "9". The ./docs/cts_subjects.tab file contains a mapping of the subject ID, sex, year of birth, and native language of each claque (A) side included in this release. The file does not include this information for the callee (B) side of any call since it was not gathered during the collection of the corpus. The contents of ./docs/cts_token_initial_diacritics.tab are explained in section 5.3 (Known Issues). Other files explain the specifications and guidelines used by transcribers and translators. The directory also contains a partitions.tab file with information about which files were designated as train, dev, test1, test2 or test3 for IWSLT22/23. 4.0 Audio Segmentation Audio segmentation was carried out in three steps. First, the original single-channel audio files were automatically segmented using SAD at min_nonspch=0.5s, and the resulting output files for the corresponding A and B channels were merged and ordered by onset time. Then, segments with a duration of 15 seconds or more were passed through SAD a second time, but this time at min_nonspch=0.3 seconds to ensure that no segments were longer than 15 seconds. Finally, the data was manually corrected by experienced transcribers in XTrans to yield transcription-ready segments. 5.0 Transcription For CTS data, segments from a call were presented to transcribers via a web interface one segment at a time and chronologically, alternating between speaker A and speaker B, for the whole call. The transcribers did not have access to the full waveform, but they did have access to adjacent segments in either of the two call sides if more context was needed. The transcribers had the option to reject a segment if it was bad (e.g. it wasn't Tunisian Arabic speech, the boundaries were incorrect, etc.). For CTS segments that were not rejected, transcribers provided an Arabic orthographic transcript typed using Buckwalter transliteration. For a subset of the data, a broad phonemic IPA transcript was added. There was also a verification pass for each segment so that the transcriber could confirm accuracy and mark individual tokens as MSA, foreign, or uncertain. For two-layer transcription, the verification pass also ensured that the Arabic orthographic transcript and the IPA transcript contained the same number of tokens. 5.1 Transcription and Translation File Format This release includes one Arabic orthographic transcript file per audio file, plus an associate English translation file. For example, for the audio file: ./data/train/audio/20161227_230341_13205_A.flac there are two corresponding text files: ./data/train/transcripts/20161227_230341_13205_A.tsv ./data/train/translations/20161227_230341_13205_A.tsv Each text file contains four columns with the following information for each segment: (1) start time (2) end time (3) subject ID (4) transcript text (in UTF-8 Arabic or English) The CTS Arabic transcripts contain token-level markup if the transcriber flagged the token as "MSA", "foreign", or "uncertain". The token flagging involved prefixing one of the following to the affected tokens: M/ MSA O/ foreign ("other language") U/ uncertain UM/ uncertain + MSA UO/ uncertain + foreign SPECIAL NOTE ABOUT DISPLAYING UTF-8 ARABIC TRANSCRIPT TEXT: In some applications that support Arabic text display, the default algorithm for bi-directional text (where a single line contains both right-to-left and left-to-right characters) may cause the ordering of word tokens within a segment to become scrambled when the various token flags and tag strings are present. In order to ensure that Arabic words in a segment are always presented in the correct right-to-left sequence in all applications (text editors, terminal emulators, etc.), each full-segment text string (column 4 of each row) should be bracketed with Unicode directionality control characters, as follows: - prepend U+202B (RIGHT-TO-LEFT EMBEDDING) as the first character - append U+202C (POP DIRECTIONAL FORMATTING) as the last character In the context of presenting data via HTML markup, there is also the alternative method of including the attribute ' dir="rtl" ' on whatever HTML tag is used to contain the Arabic segment. (In this case, the tag should hold only the segment and no additional, non-Arabic text.) 5.2 CTS Transcription Quality Control After transcription, an additional quality control pass was conducted on the corpus. This quality control effort was focused on (1) identifying and correcting common transcription errors that affected a large number of tokens and (2) reducing the number of low-frequency tokens that were clearly transcription errors. The first type of transcription error was handled systematically by carrying out global token or character replacements in the corpus. Some common error types were also reviewed manually when necessary. At this stage, care was taken to identify, manually review, and remove any non-Buckwalter characters from the Buckwalter transcripts (with the exception of some punctuation marks (e.g. "!", "?", etc.). The second type of transcription error was addressed by a quality control task setup made available to the transcription team. All unigrams with five or fewer occurrences were clustered by stripping punctuation marks, diacritic letters, and suffixes and grouping the resulting tokens that were identical. These clusters were presented to the transcribers along with the segment text for each token and a link to the segment (with audio) in the transcription tool. Transcribers provided the correct token form for each cluster. Where multiple correct forms were needed for a single cluster, transcribers marked which tokens were to be replaced with which correct token form. The token replacements were then applied within the specified segment. 5.3 Transcription - Known Issues 5.3.1 Stranded diacritic marks The Arabic orthography transcripts contain 158 instances (in 134 transcript files) of token-initial diacritic marks, sometimes co-occurring with a token flag prefix (e.g. "O/"). In the Buckwalter files, the corresponding tokens begin with one of the characters "a i o u ~". These are residual transcription errors: either an initial Buckwalter "a" should have been "A" (or some form of "alef" with diacritic mark), or else an initial consonant letter was omitted. In the Arabic files, these initial diacritic marks will "attach" to the preceding space or "/" character. A list of the affected transcript files, with the number of affected segments per file, is provided in ./docs/token-initial_diacritic_marks.tab. 5.3.2 Some empty segments in CTS Arabic transcript files In some CTS transcript files, one or more of the segments contain only a hyphen character ("-") as the full content of the transcript field. 6.0 Translation For CTS translations, the full Arabic orthographic transcript from a call (both A and B call sides) was presented as the source text to translators as an interleaved conversation. The translators were presented with speaker IDs and timestamps for each segment to aid the readability of the conversation. Translators did not have access to the audio files. Transcription mark-up was converted to standard translation-style mark-up in the data presented to translators in order to promote consistent use of mark-up in the translations and to increase readability and translation efficiency. 6.1 CTS Translation Mark-up The following mark-up is used in the English translations: (()) - Uncertain word or words %pw - Partial word # - Foreign word, either followed by translation or (()) if cannot translate + - Mispronounced word (carried over from mispronunciation marked in transcript) uh, um, eh or ah - Filled pauses = - Typographical error from transcript Detailed guidelines for translation can be found in Tunisian_Arabic_CTS_Translation_Guidelines_V1.1.pdf in the docs/ directory. 6.2 CTS Translation Quality Control All translations were automatically checked for completeness and consistent use of mark-up. Segments found to have an above or below average expansion rate from Arabic into English were flagged and manually reviewed and corrected by translators. Additional automated checks and normalizations were applied to the data in this delivery. 7. Acknowledgments The authors acknowledge the following contributors to this data set: Mohamed Maamouri (LDC) Alexander Shelmire (LDC) Jonathan Wright (LDC) Joshua Parry (LDC) Christopher Caruso (LDC) Kevin Duh (JHU) 8. Copyright Information (c) 2024 Trustees of the University of Pennsylvania 9. Contacts For further information about this data release contact the following project staff at LDC: Stephanie Strassel - PI ---------------------- README created by Dana Delgado, David Graff, Stephanie Strassel October 14, 2024