Title: BBN/LDC/Sakhr Arabic-Dialect/English Parallel Corpus 1.0 Introduction This corpus consists of Arabic dialect sentences and their English translations. The data was obtained in two ways: (A) from automatic filtering of LDC Arabic Web text corpora, and (B) from Web harvesting by Sakhr Software. Dialect classification, sentence segmentation, and translation to English were then done using Amazon's crowdsourcing service, Mechanical Turk. 2.0 Directory Content README - This file BBN-Dialect_Arabic-English-Web.xml.gz - Gzipped XML file containing segment sources, their translations and additional information. BBN-Dialect_Arabic-English-Web.turker_ratings.txt - Ratings of quality of Mechanical Turk worker translations, on a 1-10 scale. Ratings are based on human judgment of a subset of worker outputs. Ratings were obtained for 64 of 173 workers, covering 77% of the corpus (by token count). 3.0 Data Profile Note: In Table 1 below, "Source Segs" and "Source Toks" count unique source segments and source tokens. "Translated Src Segs" and "Translated Src Toks" count the total number of segments and tokens translated, accounting for multiple translations. +------------+--------------+---------------------+-------------+---------------------+---------------+ | Dialect | Source Segs | Translated Src Segs | Source Toks | Translated Src Toks | English Toks | +------------+--------------+---------------------+-------------+---------------------+---------------+ | EGYPTIAN | 36,036 | 38,154 | 345,399 | 361,375 | 511,311 | | LEVANTINE | 125,890 | 138,010 | 1,045,684 | 1,132,923 | 1,559,071 | +------------+--------------+---------------------+-------------+---------------------+---------------+ | Total | 161,926 | 176,164 | 1,391,083 | 1,494,298 | 2,070,382 | +------------+--------------+---------------------+-------------+---------------------+---------------+ Table 1: Number of segments and tokens in the BBN/LDC/Sakhr Arabic-Dialect/English parallel corpus. 4.0 Data Format The XML data file has the following format. EGYPTIAN ... ... ... ... ... ... ... The "SEGMENT" element contains the source and translation of a segment (i.e. segmented sentence), its translation, and related information. The "DIALECT" element contains the dialect label for the segment. The "GUID" field contains a unique identifier of the translated segment. Appends the "SOURCE GUID" and the translator ID. The GUID field has the following format: "[id1][id2][segment_index][turker_id]". The substring "[id1][id2]" identifies the passage that the current segment belongs to. "[segment_index]" is the 0-based index of this segment in the passage. "[turker_id]" is th e unique identified of the Mechanical Turk worker that translated this segment. The "SOURCE_GUID" field is a unique identifier of the source segment. It is a substring for the "GUID" field. The "SEG_NUM" field contains the 0-based index of this segment in the passage. The "SOURCE" field contains the original source of the segment The "TARGET" field contains the raw (untokenized) translation of the segment. The "TURKER_ID" field contains a unique, anonymous identifier of the Mechanical Turk worker that translated the segment. 5.0 Data Collection The data in this corpus originated from two sources: Part A. Filtered automatically from large Arabic text corpora harvested from the web by LDC. The original filtering was applied to the following LDC Corpora: LDC2006E32, LDC2006E77, LDC2006E90, LDC2007E04, LDC2007E44, LDC2007E102 , LDC2008E41, LDC2008E54, LDC2009E14, LDC2009E93. These consist largely of weblog and online user groups, and amount to around 350 million Arabic words. Documents that contain a large percentage of non-Arabic or MSA words were eliminated. Then a list of dialect words was manually selected by culling through the Levantine Fisher and Egyptian CallHome speech corpora. The list was then used to retain documents that contain a certain number of matches. The resulting subset of the web corpora contained around four million words. Documents were automatically segmented into passages using formatting information from the raw data. In Table 1, 37% of Egyptian tokens and 61% of Levantine tokens were obtained this way. Part B. Manually harvested by Sakhr Software from dialect web sites. In Table 1, 63% of Egyptian tokens and 39% of Levantine tokens were obtained this way. 6.0 Data Processing Dialect classification and sentence segmentation, as needed, and translation to English were all performed by BBN through Amazon's Mechanical Turk. Dialect classification and sentence segmentation for Part A of the corpus was processed through Mechanical Turk, as described below. Because the data collection for Part B targeted specific dialects and was segmented manually at the time of collection, it did not require any further dialect classification and segmentation. The translation of the whole Arabic dialect corpus into English was performed using Mechanical Turk. Details on each of the Mechanical Turk tasks follows. Dialect Classification ---------------------- Arabic annotators from Mechanical Turk were hired to classify the filtered passages for being as either Modern Standard Arabic (MSA) or one of four regional dialects: Egyptian, Levantine, Gulf/Iraqi, or Maghrebi. An additional "General" dialect option was allowed for ambiguous passages. The classification was applied to whole passages, rather than individual sentences. Quality control was done by using a set of passages for which the dialect labels were known. Control passages were presented 20% of the time, and poor-performing workers were eliminated regularly. Initially, three classifications from different workers were performed for each passage, and dialect labels were accepted only if at least two of them agreed. The number of classifications was subsequently reduced to two per passage. The passage rejection rate was 2%. Only the passages labeled Levantine and Egyptian were further processed. Sentence Segmentation --------------------- Since the data filtered from the LDC web corpora mostly consisted of user-generated informal web content, the existing punctuation was often insufficient to determine sentence boundaries. Documents were automatically segmented into passages using formatting information from the raw data. Passages were then segmented into individual sentences using Mechanical Turk, before being translated. Only passages longer than 15 words were required to be segmented. Mechanical Turk workers were allowed to split and rejoin at any point between the tokens. A set of correctly segmented passages was also used for quality control, and the workers were scored using a metric based on the precision/recall of the correct segmentation points. The passage rejection rate was 1.2%. Translation to English ---------------------- The segmented Levantine and Egyptian sentences were then translated using Mechanical Turk. The workers were instructed to translate completely and accurately, and to transliterate Arabic names. They were also provided with examples. All segments of a passage were presented in the same translation task to give the translator enough context. Several quality control measures were employed. The Turkers were prevented from simply cutting-and-pasting the Arabic text into translation software by rendering the Arabic sentences as images rather than text, and by spot checking the translations against the output of Google Translate and Bing Translator. Garbage input was detected by counting the percentage of words not found in a predefined English lexicon. The quality of individual translators was quantified in two ways: first by asking judges who are native speakers of Arabic to score a sample of each worker's translations, and second by inserting control sentences for which we have good reference translations and measuring the workers METEOR and BLEU-1 scores. The rejection rate of translation assignments was 5%. 173 translators in total worked on the translations. 121 of them translated 20 or more passages each.