Title: BOLT Egyptian Arabic SMS/Chat Parallel Training Data Authors: Jennifer Tracey, Dana Delgado, Song Chen, Stephanie Strassel 1.0 Introduction This corpus consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data in Egyptian Arabic, along with translations into English. The collection and translation were performed as part of the DARPA BOLT Program. Additional SMS/Chat source data collected under the BOLT prgram can be found in LDC2017T07 BOLT Egyptian Arabic SMS/Chat and Transliteration, LDC2018T15 BOLT Chinese SMS/Chat, and LDC2018T19 BOLT English SMS/Chat. The DARPA BOLT (Broad Operational Language Translation) Program developed genre-independent machine translation and information retrieval systems. While earlier DARPA programs made significant strides in improving natural language processing capabilities in structured genres like newswire and broadcasts, BOLT was particularly concerned with improving translation and information retrieval performance for less-formal genres with a special focus on user-contributed content. LDC supported the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in Chinese, Egyptian Arabic and English. The collected data was translated and richly annotated for a variety of tasks including word alignment, Treebanking, PropBanking, and co-reference. LDC supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing information retrieval system responses during annual evaluations conducted by NIST. This corpus comprises the Egyptian Arabic parallel training data created for BOLT Phases 2 and 3. The corpus contains SMS and chat conversations between two or more native Egyptian Arabic speakers. The bulk of the data in this release consists of naturally-occurring, pre-existing SMS or chat message archives donated by consented Egyptian Arabic speakers. Donated data is supplemented by new conversations among people who know one another, collected by LDC specifically for BOLT using a custom collection platform. All data was obtained with the informed consent of the Egyptian Arabic speakers. Collected and donated data was manually reviewed by LDC to exclude any messages that were not in the target language, had potentially sensitive content, such as personal identifying information (PII), or contained offensive content and to select the richest possible data for translation by excluding repetitive or vacuous content wherever possible. However, due to the informal nature of SMS/chat messages and subjective nature of offensiveness, profanity or potentially offensive content may be present in the data. Collected conversations were then translated into English by professional translators using guidelines similar to those used in Phase 1 of BOLT for Discussion Forum data with genre-specific adaptations to address features found in SMS/Chat data. The target of this collection was Egyptian Arabic, and conversations that were not primarily in Egyptian were rejected during data selection. However, given the informal nature of the data, a small amount of mixing with English, French and/or other varieties of Arabic is to be expected. 2.0 Directory structure docs/README.txt - this file data/ For each translated source document there are a pair of files in the release: /source/__...su.xml and /target/__...su.xml where is one of the genres: SMS and CHT is ARZ which stands for Egyptian Arabic is YYYYMMDD date of the first message of a conversation four-digit identifier is arz is eng docs/ filestats.tab - inventory of source files for Arabic, including an SU count and token count for each file (for Arabic, a token is a word). filelist.txt - inventory of files in this release BOLT_Arabic_translation_guidelines_v1.9.pdf - Arabic translation guidelines dtds/ su-conversation.dtd - a dtd for .xml files 3.0 Data profile The release includes both Arabic-English source files and translations. The following table shows the data volume of this package (count on the source side): +--------------+-------+-------+----------+---------------+ | source lang | genre | files | messages | source tokens | |--------------+-------+-------+----------+---------------| | Arabic | CHT | 3562 | 136050 | 616649 | | Arabic | SMS | 383 | 21519 | 106733 | +--------------+-------+-------+----------+---------------+ Token counts are expressed in terms of words in Arabic. 3.1 Data selection and auditing After collection, each conversation was audited by LDC to ensure compliance with language requirements and to flag: - any sensitive personal identifying information (PII) or offensive content - messages not in the target language - messages that are duplicates - auto-generated messages by Chat clients Messages/conversations not in the target language or containing PII or sensitive/offensive content are removed from the corpus. Messages that are predominantly in target language with occasional words in a different language are retained. Messages consisting solely of auto-generated mark-up,for example, <media omitted> were intended to be excluded from translation and downstream annotation, but a few residual cases not caught by the clean-up remain in the data. 3.2 Sentence segmentation Messages in all conversations are arranged in chronological order. When a message sent by a participant in either live collection or donated data exceeded the character-length limit for SMS, the message could be divided into two or more parts by the SMS service provider. Depending on the combination of phone model, operating system, and service provider, such divided messages could be marked with notations such as (1/2) and (2/2) to indicate that they were sent as a single message. For the purpose of downstream annotation and translation, such messages were manually recomposed into a single message during the auditing and sentence segmentation phase of processing. The entire manually recomposed message is identified as a single SU (sentence unit). In addition, very long messages (exceeding 3-4 sentences of content) were manually split into shorter units to enhance the quality of translation and maintain good alignment between the source and translation segments. Each part of a message that was manually split is identified as a single SU. 4.0 Data format All source and translation documents are in xml format, and each su element contains information about the message(s) that contributed content to the su. The xml files have the following format: Medium value is either SMS or CHT (chat) and donated value is either true or false where true indicates the conversation is from a donated archive and false indicates the conversation was collected via LDC's SMS and chat collection platform. Reserved characters such as "&" have been escaped using the standard format (e.g., "&"). Proper ingesting of XML data requires an XML parsing library. The conversation_id is the file name minus the extension. Each message has message id, subject id, and date attributes and contains a message body. For each xml file, the "" element of the "source" xml contains the original Arabic text, while the "" element of the corresponding "target" xml contains the English translation; both xmls also contain the original text in their "message" elements. For more information see docs/su-conversation.dtd Note: 154 pair of source and target xml files contain one or more "presentation-form" Arabic characters 5.0 Data processing Data was originally collected in a variety of formats, due to differences between donated and collected data. These formats were normalized; the content of message bodies was not altered except to convert from UTF-16 to UTF-8, replacing carriage returns with newlines, and remove apparently extraneous newlines and quotes from the periphery of messages. Internal newlines may still occur when they are part of the content entered by the message sender. Dates were converted to UTC, and the various original means of identifying participants were converted to LDC subject IDs. Participant IDs are assigned consistently within each donated archive, but LDC did not make any effort to normalize participant IDs across donated archives, as such information is not consistently available in the donations. Message IDs were assigned, local to each conversation, starting at m0000, based on the message order by date-time, which is also the order in which messages are displayed in the output. Note that if a message is deleted from a conversation during auditing, the message number sequence will reflect the deletion in that it will have non-contiguous numbering. For example, if a conversation originally contained 6 messages but the third message is deleted during auditing because it contains PII, the conversation xml will contain messages with IDs m0000, m0001, m0003, m0004, and m0005. If participants delete messages before uploading their archive, LDC has no way of detecting this. Therefore, conversations with message IDs whose numbering is continuous will not necessarily have continuity of content. Conversation IDs were assigned based on medium, language, and the date of the first message. Donated messages were extracted from various applications and devices. These different sources use varying styles of newlines. For simplicity and consistency all newlines have been converted to use the single-character, Unix-style line-feed, "\n." 6.0 Translation Translation is done by professional translatators from the original source form, Arabizi (Latin script) or Arabic. Below are the translation tags that we use to mark certain features in the target translation: Translation: - translation alternatives [intended meaning | literal meaning] - correction of typo =text - best guess translation ((text)) 7.0 Known Issues Some conversations included a range of emoticon characters whose Unicode code-point values occupy the "Private Use Area" of the Unicode character table. These characters have been left in place. 8.0 Sponsorship This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 9.0 Contact Information If you have any questions about the data in this release, please contact the following personnel at the LDC. Jennifer Tracey -BOLT SMS/Chat Translation Manager Dana Delgado -BOLT SMS/Chat Translation Coordinator Stephanie Strassel -BOLT PI ----------- README created by Dana Delgado on April 9, 2018 README updated by Dana Delgado on October 17, 2018