README FILE FOR LDC CATALOG ID: LDC2025S05

TITLE: IWSLT22 and IWSLT23 Tunisian Arabic Shared Task Training,
Development and Test Data

AUTHORS: Michael Arrigo, Dana Delgado, Stephanie Strassel, David Graff

1.0 Introduction

This release contains data used for system training, development and
testing in the International Conference on Spoken Language Translation
(IWSLT) 2022 Dialectal Speech Translation Task and the 2023 Dialectal
and Low-Resource Speech Translation Task. Data in this release is
split into five partitions, with each split containing audio,
transcripts and translations. The train, dev and test 1 splits were
used by participants in both IWSLT22 and IWSLT23 to build and develop
their systems, while the test2 and test3 partitions represent the
official evaluation data for IWSLT22 and IWSLT23 respectively.

The data comprises audio and transcriptions of Tunisian Arabic (TA)
audio along with translations into English. The TA audio in this
release consists of conversational telephone speech (CTS) recordings
collected from consented human subjects in Tunis, via a robot operator
system which recorded digital sample data directly from the regional
public telephone network.

The transcripts are presented as tab-delimited tables containing
UTF8-encoded Arabic script, and translations are presented in parallel
fashion (including some UTF-8 encoded non-ASCII characters, such as
accented vowels). The TA CTS audio, transcripts and translations are
stored as pairs of single-channel files representing the two sides
("A" and "B") of each conversation. Channel/side A is a "claque"
speaker (someone recruited to make calls) and channel/side B is a
"callee" (a conversation partner of the claque's choosing who also
consented to be recorded).

In total, this release contains 210.08 hours of 2-channel CTS audio
recordings in 1188 conversations, with transcripts covering nearly
175 hours of time-stamped speech segments, with English translations
for all transcripts.

2.0 Directory Structure

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

 ./README.txt -- this file

 ./data/
        ./dev/audio
        ./dev/transcripts
        ./dev/translations

	./test1/audio
     	./test1/transcripts
     	./test1/translations

        ./test2/audio
        ./test2/transcripts
        ./test2/translations

        ./test3/audio
        ./test3/transcripts
        ./test3/translations

        ./train/audio
        ./train/transcripts
        ./train/translations

 ./docs/
       callid_audiofile_subjectid_map.tab
       CMN2_Tunisian_Arabic_CTS_Transcription_Appendix_Function_Words_V5.0.pdf
       CMN2_Tunisian_Arabic_CTS_Transcription_Guidelines_V6.0.pdf
       cts_subjects.tab
       cts_token-initial_diacritics.tab
       partitions.tab
       Tunisian_Arabic_CTS_Translation_Guidelines_V1.1.pdf

3.0 Contents

3.1 Audio

The files in ./data/{dev|test1|test2|test3|train}/audio/ are FLAC-compressed 
MS-WAV files containing 16-bit PCM data originally recorded at 8000 samples/sec.

3.2 Transcription

The table below shows the number of segments, hours, and files
provided in this release:

+-----------+---------+---------+---------+
| type      | n_segs  | n_hours | n_files |
+-----------+---------+---------+---------+
| TA / CTS  | 219,481 |  174.58 |  2,376  |
+-----------+---------+---------+---------+
| total     | 219,481 |  174.58 |  2,376  |
+-----------+---------+---------+---------+

Note that the durations above are based on the speech segments alone
and do not include durations for nonspeech segments.


3.3 Translation

The table below shows the number of segments, hours, and files
for the translation files in this release:

+-----------+---------+---------+---------+
| type      | n_segs  | n_hours | n_files |
+-----------+---------+---------+---------+
| TA / CTS  | 219,481 |  174.58 |  2,376  |
+-----------+---------+---------+---------+
| total     | 219,481 |  174.58 |  2,376  |
+-----------+---------+---------+---------+

Note that the durations above are based on the speech segments alone
and do not include durations for nonspeech segments.

3.3 Documentation

The ./docs/callid_audiofile_subjectid_map.tab file contains a mapping of the
call ID, the corresponding audio file, and the subject ID of the speaker for
each audio file in this release.  The subject ID of the callee (B) side for
each call is the call ID plus an initial "9".

The ./docs/cts_subjects.tab file contains a mapping of the subject ID, sex,
year of birth, and native language of each claque (A) side included in this
release.  The file does not include this information for the callee (B) side
of any call since it was not gathered during the collection of the corpus.

The contents of ./docs/cts_token_initial_diacritics.tab are explained in
section 5.3 (Known Issues).

Other files explain the specifications and guidelines used by transcribers and
translators.

The directory also contains a partitions.tab file with information
about which files were designated as train, dev, test1, test2 or test3
for IWSLT22/23.

4.0 Audio Segmentation

Audio segmentation was carried out in three steps. First, the original
single-channel audio files were automatically segmented using SAD at
min_nonspch=0.5s, and the resulting output files for the corresponding
A and B channels were merged and ordered by onset time.  Then,
segments with a duration of 15 seconds or more were passed through SAD
a second time, but this time at min_nonspch=0.3 seconds to ensure that
no segments were longer than 15 seconds.  Finally, the data was
manually corrected by experienced transcribers in XTrans to yield
transcription-ready segments.


5.0 Transcription

For CTS data, segments from a call were presented to transcribers via a web
interface one segment at a time and chronologically, alternating between
speaker A and speaker B, for the whole call.  The transcribers did not have
access to the full waveform, but they did have access to adjacent segments in
either of the two call sides if more context was needed.  The transcribers had
the option to reject a segment if it was bad (e.g. it wasn't Tunisian Arabic
speech, the boundaries were incorrect, etc.).

For CTS segments that were not rejected, transcribers provided an Arabic
orthographic transcript typed using Buckwalter transliteration.  For a
subset of the data, a broad phonemic IPA transcript was added.  There
was also a verification pass for each segment so that the transcriber
could confirm accuracy and mark individual tokens as MSA, foreign, or
uncertain.  For two-layer transcription, the verification pass also
ensured that the Arabic orthographic transcript and the IPA transcript
contained the same number of tokens.

5.1 Transcription and Translation File Format

This release includes one Arabic orthographic transcript file per
audio file, plus an associate English translation file.

For example, for the audio file:

 ./data/train/audio/20161227_230341_13205_A.flac

there are two corresponding text files:

 ./data/train/transcripts/20161227_230341_13205_A.tsv
 ./data/train/translations/20161227_230341_13205_A.tsv

Each text file contains four columns with the following information for each
segment:

(1) start time
(2) end time
(3) subject ID
(4) transcript text (in UTF-8 Arabic or English)

The CTS Arabic transcripts contain token-level markup if the transcriber
flagged the token as "MSA", "foreign", or "uncertain". The token flagging
involved prefixing one of the following to the affected tokens:

M/   MSA
O/   foreign ("other language")
U/   uncertain
UM/  uncertain + MSA
UO/  uncertain + foreign

SPECIAL NOTE ABOUT DISPLAYING UTF-8 ARABIC TRANSCRIPT TEXT:

In some applications that support Arabic text display, the default algorithm
for bi-directional text (where a single line contains both right-to-left and
left-to-right characters) may cause the ordering of word tokens within a
segment to become scrambled when the various token flags and tag strings are
present.

In order to ensure that Arabic words in a segment are always presented in the
correct right-to-left sequence in all applications (text editors, terminal
emulators, etc.), each full-segment text string (column 4 of each row) should
be bracketed with Unicode directionality control characters, as follows:

   - prepend U+202B (RIGHT-TO-LEFT EMBEDDING) as the first character
   - append U+202C (POP DIRECTIONAL FORMATTING) as the last character

In the context of presenting data via HTML markup, there is also the
alternative method of including the attribute ' dir="rtl" ' on whatever
HTML tag is used to contain the Arabic segment.  (In this case, the tag
should hold only the segment and no additional, non-Arabic text.)

5.2 CTS Transcription Quality Control

After transcription, an additional quality control pass was conducted
on the corpus.  This quality control effort was focused on (1)
identifying and correcting common transcription errors that affected a
large number of tokens and (2) reducing the number of low-frequency
tokens that were clearly transcription errors.

The first type of transcription error was handled systematically by
carrying out global token or character replacements in the corpus.
Some common error types were also reviewed manually when necessary.
At this stage, care was taken to identify, manually review, and remove
any non-Buckwalter characters from the Buckwalter transcripts (with
the exception of some punctuation marks (e.g. "!", "?", etc.).

The second type of transcription error was addressed by a quality
control task setup made available to the transcription team.  All
unigrams with five or fewer occurrences were clustered by stripping
punctuation marks, diacritic letters, and suffixes and grouping the
resulting tokens that were identical.  These clusters were presented to
the transcribers along with the segment text for each token and a link
to the segment (with audio) in the transcription tool.  Transcribers
provided the correct token form for each cluster.  Where multiple
correct forms were needed for a single cluster, transcribers marked
which tokens were to be replaced with which correct token form.  The
token replacements were then applied within the specified segment.

5.3 Transcription - Known Issues

5.3.1 Stranded diacritic marks

The Arabic orthography transcripts contain 158 instances (in 134 transcript
files) of token-initial diacritic marks, sometimes co-occurring with a token
flag prefix (e.g. "O/").  In the Buckwalter files, the corresponding tokens
begin with one of the characters "a i o u ~".  These are residual
transcription errors: either an initial Buckwalter "a" should have been "A"
(or some form of "alef" with diacritic mark), or else an initial consonant
letter was omitted.  In the Arabic files, these initial diacritic marks will
"attach" to the preceding space or "/" character.  A list of the affected
transcript files, with the number of affected segments per file, is provided
in ./docs/token-initial_diacritic_marks.tab.

5.3.2 Some empty segments in CTS Arabic transcript files

In some CTS transcript files, one or more of the segments contain only a
hyphen character ("-") as the full content of the transcript field.

6.0 Translation

For CTS translations, the full Arabic orthographic transcript from a call
(both A and B call sides) was presented as the source text to translators as
an interleaved conversation.  The translators were presented with speaker IDs
and timestamps for each segment to aid the readability of the conversation.
Translators did not have access to the audio files.  Transcription mark-up was
converted to standard translation-style mark-up in the data presented to
translators in order to promote consistent use of mark-up in the translations
and to increase readability and translation efficiency.

6.1 CTS Translation Mark-up

The following mark-up is used in the English translations:

      (()) - Uncertain word or words
      %pw - Partial word
      #  - Foreign word, either followed by translation or (()) if
            cannot translate
      +  - Mispronounced word (carried over from mispronunciation
           marked in transcript)
      uh, um, eh or ah - Filled pauses
      =  - Typographical error from transcript

Detailed guidelines for translation can be found in
Tunisian_Arabic_CTS_Translation_Guidelines_V1.1.pdf in the docs/
directory.

6.2 CTS Translation Quality Control

All translations were automatically checked for completeness and consistent
use of mark-up. Segments found to have an above or below average expansion
rate from Arabic into English were flagged and manually reviewed and
corrected by translators. Additional automated checks and normalizations
were applied to the data in this delivery.

7. Acknowledgments

The authors acknowledge the following contributors to this data set:
Mohamed Maamouri (LDC)
Alexander Shelmire (LDC)
Jonathan Wright (LDC)
Joshua Parry (LDC)
Christopher Caruso (LDC)
Kevin Duh (JHU)

8. Copyright Information

(c) 2024 Trustees of the University of Pennsylvania

9. Contacts

For further information about this data release contact the following
project staff at LDC:

Stephanie Strassel <strassel@ldc.upenn.edu> - PI

----------------------
README created by Dana Delgado, David Graff, Stephanie Strassel
October 14, 2024