Title: BOLT Egyptian Arabic SMS/Chat and Transliteration Authors: Zhiyi Song, Dana Fore, Stephanie Strassel, Haejoong Lee, Jonathan Wright 1. Introduction This file contains documentation for the BOLT Egyptian Arabic SMS/Chat and Transliteration Corpus. This corpus consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Egyptian Arabic. The DARPA BOLT (Broad Operational Language Translation) Program developed genre-independent machine translation and information retrieval systems. While earlier DARPA programs made significant strides in improving natural language processing capabilities in structured genres like newswire and broadcasts, BOLT was particularly concerned with improving translation and information retrieval performance for less-formal genres with a special focus on user-contributed content. LDC supported the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in Chinese, Egyptian Arabic and English. The collected data was translated and richly annotated for a variety of tasks including word alignment, Treebanking, PropBanking, and co-reference. LDC supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing information retrieval system responses during annual evaluations conducted by NIST. This corpus comprises the Egyptian Arabic training data collected and annotated for BOLT Phases 2 and 3. The corpus contains SMS and chat conversations between two or more native Egyptian Arabic speakers. The bulk of the data in this release consists of naturally-occurring, pre-existing SMS or chat message archives donated by consented Egyptian Arabic speakers. Donated data is supplemented by new conversations among people who know one another, collected by LDC specifically for BOLT using a custom collection platform. All data was obtained with the informed consent of the Egyptian Arabic speakers. Collected and donated data was manually reviewed by LDC to exclude any messages that were not in the target language or that had potentially sensitive content, such as personal identifying information (PII). The corpus contains 5691 conversations totaling 1,029,248 words across 262,026 messages. Messages are natively written in either Arabic orthography or romanized Arabizi. A total of 1856 Arabizi conversations (287,022 words) have been transliterated from the original romanized Arabizi script into standard Arabic orthography. Section 4 below describes the data collection, auditing and transliteration process in detail. 2. Package structure README.txt - this file data/ - directory containing data files source/ - source conversations transliteration/ - transliterated conversations docs/ - directory containing package documents conversation0.2.1.dtd - a DTD for .conv.xml files in the data/source directory transliteration.dtd - a DTD for .transli.xml files in the data/transliteration directory BOLT_P2_Arabizi_transliteration_guidelines_v3.1.pdf - Arabibizi to Arabic transliteration annotation guidelines gt24hrs.txt - list of conversations that contain a gap between messages larger than 24 hours arabic_sms_chat_source_collection.tab - source file list with message and word count arabic_sms_chat_transliteration.tab - transliteration file list with SU and token count The filenaming convention for xml files is by conversation ID, which is __. where is one of the genres: SMS and CHT is ARZ which stands for Egyptian Arabic is YYYYMMDD date of the first message of a conversation four-digit identifier 3.0 Contents The tables below show the quantity of source and transliteration by genre: Source: +----------+----------+------------+-------------+-------------+------------+ | language | genre | collection |num_conv | num_message | num_word | +----------+----------+------------+-------------+-------------+------------+ | arz | sms | donated | 404 | 22,228 | 96,476 | +----------+----------+------------+-------------+-------------+------------+ | arz | sms | live | 42 | 2,121 | 19,946 | +----------+----------+------------+-------------+-------------+------------+ | arz | cht | donated | 5,515 | 237,677 | 912,826 | +----------+----------+------------+-------------+-------------+------------+ Transliteration: +----------+----------+------------+-------------+------------+ | language | genre | num_conv | num_SU | num_token | +----------+----------+------------+-------------+------------+ | arz | sms | 260 | 10,809 | 55,469 | | arz | cht | 1,596 | 56,300 | 231,553 | +----------+----------+------------+-------------+------------+ The files source.tab and transliteration.tab in the docs/ directory list the inventory of documents with relevant quantities for each file. 4. BOLT SMS and Chat Collection Pipeline The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. All data collected were reviewed manually to exclude any messages/conversations that were not in the target language or that had sensitive content, such as personal identifying information (PII). 4.1 LDC's SMS and Chat Collection Platform For text messaging (SMS) collection, LDC's collection platform initiated each session by sending a text message to a pair of consented participants, introducing them to one another and inviting them to begin texting. The participants were native Egyptian Arabic speakers who were typically known to one another but could be strangers. Participants replied to the initiating message to start the conversation. The collection platform relayed messages between the participants, so they experienced normal SMS conversations. Relayed messages were stored in LDC's database along with participant and conversation metadata. For chat messaging collection, LDC's chat robot sent a message to each participant pair inviting them to start a session. As with the SMS collection, the participants were typically known to one another but could be strangers. The participants carried on a discussion and the robot captured the conversation. All conversations were stored in the collection database along with participant and conversation metadata. For both SMS and chat collections, there was no suggested topic and participants were free to discuss any topic of their own choosing. For SMS and chat data from live collection, a conversation was defined as messages between a pair of participants within a 24-hour time frame. 4.2 SMS and Chat Collection from Donations Consented, native Egyptian Arabic speaking participants followed LDC's instructions to create an archive of their SMS or chat data from their phone or computer and upload the archive to LDC's collection site. Participants had an opportunity to edit their archives prior to final upload to exclude any data they didn't want to donate. Participants could delete entire messages and/or search their messages and redact specific content, using a simple GUI developed by LDC. Redacted content was replaced with "#", preserving a one-to-one character mapping. Post-processing of the uploaded archive included checking for duplication, doing a simple automated language ID, and dividing the archive into conversations. An archive is first automatically divided into groups of messages between particular sets of SMS/chat partners, and those message groups are further subdivided into conversations every time a chat partner takes more than 24 hours to respond. For example: an archive contains messages from Person A's phone. It has conversations involving Person A, B and C chatting, which we'll call them Group 1. Person A is chatting separately with Person D; that's Group 2. In Group 2, Person D has for some reason not replied to a message sent by Person A at 3pm yesterday, until 7pm today - that's 28 hours between messages, so Group 2 will have two conversations: those messages before 3pm yesterday, and those after 7pm today. So in the end, the archive from Person A may be divided into multiple conversations. 4.3 Auditing After collection, each conversation was audited by LDC to ensure compliance with language requirements and to flag: - any sensitive personal identifying information (PII) - messages not in the target language - messages that are duplicates - auto-generated messages by Chat clients Messages/conversations not in the target language or containing PII or sensitive content was removed from the corpus. Messages that are predominantly in target language with occasional words in a different language are retained. Messages consisted solely of auto-generated mark-up are retained in the source files. For example: <media omitted> Those are intended to be excluded from transliteration and downstream annotation, but the cleanup was not complete so some are left in. 4.4 Data Selection and Sentence Segmentation Messages in all conversations are arranged in chronological order. When a message sent by a participant in either live collection or donated data exceeded the character-length limit for SMS, the message could be divided into two or more parts by the SMS service provider. Depending on the combination of phone model, operating system, and service provider, such divided messages could be marked with notations such as (1/2) and (2/2) to indicate that they were sent as a single message. For the purpose of downstream annotation such as transliteration and translation, LDC manually recomposed such messages into a single message in the source data for translation into English. The entire manually recomposed message is identified as a single SU (sentence unit). In addition, very long messages (exceeding 3-4 sentences of content) were manually split into shorter units in the source data for annotation and translation. Each part of a message that was manually split is identified as a single SU. 4.5 Automatic Transliteration from Romanized Arabic into Arabic Orthography A portion of the source conversations containing Arabizi tokens were automatically transliterated into Arabic script according to "Conventional Orthography for Dialectal Arabic" (CODA) (Habash, 2012a), using a pre-release version of the 3ARRIB system developed at Columbia University by Nizar Habash, Mohamed Badrashiny, Ramy Eskander, and Owen Rambow. 3ARRIB (lit. Arabize in Arabic; pronounced /ar.rib/) uses a 2K word list from Qatar Computing Research Institute that maps Romanized words to Arabic script. The word list is used to learn models for transforming letter sequences from Romanized Arabic to Arabic script. The CALIMA-ARZ morphological analyzer (Habash et al., 2012b) is used as a filter to limit over-generation. An Arabic language model is used to select optimal in-context choices. Details of 3Arrib is described in Eskander et al, 2014. A online demo of 3ARRIB can be seen at http://nlp.ldeo.columbia.edu/arrib/ The auto transliteration was generated to expedite manual transliteration and should not be considered a system to compare against and use to judge 3ARRIB quality. Hence the auto transliteration is not included in this package. 4.6 Annotation and Manual Correction of Transliteration Once the Arabizi source was transliterated into Arabic script automatically, LDC annotators reviewed, corrected and normalized the transliteration according to CODA. Annotation and correction were performed only on Arabizi tokens; SUs that are originally in Arabic script were not annotated or corrected in this annotation process. If a conversation or segment contained both Arabizi and Arabic segments, only Arabizi tokens were affected by annotation, and the rest were left alone. Annotators were presented a source conversation in its original form as well as the automatic transliteration output from 3ARRIB when applicable. The task was performed using a web-based tool. In annotation of the Arabizi source text, annotators flag tokens that are filled pauses or laughter, punctuation, Arabic proper names and foreign words (e.g. English or French). Below are the flags used in annotation: sound: filled pauses (such as "hmmm") and laughter (such as "haha") name: proper names, mostly person names foreign: foreign words, such as English, French etc. Arabic words that are not Egyptian dialect are not flagged. punctuation: standalone punctuation. Punctuation attached to tokens is not flagged. Emoticons are not flagged, but converted to # in corrected transliteration. The annotation on the source was done token by token, so if a token contained multiple words with each word having different features, no flag was applied to the token. In correction and normalization, annotators corrected the automatic transliteration according to CODA and intended meaning; edits included spelling corrections, and splitting or joining tokens as needed. Correction and edits were not performed on source tokens that were flagged as "sound", "foreign" and "punctuation". Tokens identified as Arabizi or flagged as "name" were corrected and edited. Refer to section 5 below. The annotation and transliteration guidelines are included in the package under docs/. 5. Data Format 5.1 Source Data Conversation Format The conv.xml files have the following format: Medium value is either SMS or CHT (chat) and donated value is either true or false where true indicates the conversation is from a donated archive and false indicates the conversation was collected via LDC's SMS and chat collection platform. Reserved characters such as "&" have been escaped using the standard format (e.g., "&"). Proper ingesting of XML data requires an XML parsing library. The conversation_id is the file name minus the extension. Each message has message id, subject id, and date attributes and contains a message body. For more information see docs/conversation0.2.1.dtd 5.2 Transliteration Format The transliteration documents are in transli.xml format. Each su element contains information about the message(s) that contributed content to the su. For conversations that contain Arabizi tokens, each in the XML file has six layers of content, tagged as follows in the mark-up: : original SU in Arabizi tokens : Arabizi tokens with annotations added : manually corrected transliteration which matches the exact token count of the original Arabizi SU and auto transliterated SU; tokens that need to be split are marked with [-] at the spot where split happens; tokens that need to be joined have [+] attached to the first token : corrected transliteration in which tokens marked for join or split are joined or split accordingly : This is included as a convenience, to show the original message(s) from which the content was derived. Note that conversations that contain both Arabizi and Arabic tokens have all six layers, regardless of whether the segment is entirely in Arabic or Arabizi. See docs/transliteration.dtd for more details. 6. Data Processing Data was originally in a variety of formats, due to differences between donated and collected data. These formats were normalized; the content of message bodies was not altered except to convert from UTF-16 to UTF-8, replacing carriage returns with newlines, and removing apparently extraneous newlines and quotes from the periphery of messages. Internal newlines may still occur when they are part of the content entered by the message sender. Dates were converted to UTC, and the various original means of identifying participants were converted to LDC subject IDs. Participant IDs are assigned consistently within each donated archive, but LDC did not make any effort to normalize participant IDs across donated archives, as such information is not consistently available in the donations. Message IDs were assigned, local to each conversation, starting at m0000, based on the message order by date-time, which is also the order in which messages are displayed in the output. Note that if a message is deleted from a conversation during auditing, the message number sequence will reflect the deletion in that it will have non-contiguous numbering. For example, if a conversation originally contained 6 messages but the third message is deleted during auditing because it contains PII, the conversation xml will contain messages with IDs m0000, m0001, m0003, m0004, and m0005. If participants delete certain messages before uploading their archive, LDC has no way of detecting this. Therefore, conversations with message IDs whose numbering is continuous will not necessary have continuity of content. Conversation IDs were assigned based on medium, language, and the date of the first message. Donated messages were extracted from various applications and devices. These different sources use varying styles of newlines. For simplicity and consistency all newlines have been converted to use the single-character, Unix-style line-feed, "\n." 7. Known Issues Some conversations included a range of emoticon characters whose Unicode code-point values occupy the "Private Use Area" of the Unicode character table. These characters have been left in place. Some conversations contain a gap between messages greater than 24 hours in duration. A list of these conversations may be found in docs/gap24hrs.tab. There are two possible reasons behind the issue: - Some conversations were donated and processed before the 24-hour rule was implemented. - During auditing, some messages were flagged and hence excluded, which then increased the gap between surrounding messages to more than 24 hours These files are being left as-is, containing an over-long gap within the message sequence, rather than being split into separate conversations. 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The authors acknowledge Kevin Walker, Jennifer Garland, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan, Stephen Grimes, David Graff, Will Haun and Ann Sawyer for their help and support in collection infrastructure, data processing, delivery preparation and documentation. The authors acknowledge Nizar Habash, Owen Rambow, Ramy Eskandar and Mohamed Al-Badrashiny for their support in developing the automated procedures to make Arabizi transliteration more efficient. 9. References Ann Bies, Zhiyi Song, Mohamed Maamouri, Stephen Grimes, Haejoong Lee, Jonathan Wright, Stephanie Strassel, Nizar Habash, Ramy Eskander, Owen Rambow. 2014. Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus. EMNLP 2014: Conference on Empirical Methods on Natural Language Processing, Doha, October 25-29. Ramy Eskander, Mohamed Al-Badrashiny, Nizar Habash and Owen Rambow. Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script. In Arabic Natural Language Processing Workshop, EMNLP, Doha, Qatar. Nizar Habash, Mona Diab, and Owen Rambow. 2012a. Conventional Orthography for Dialectal Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul. Nizar Habash, Ramy Eskander and Abdeltai Hawwari. 2012b. A Morphological Analyzer for Egyptian Arabic. In NAACL-HLT 2012 Workshop on Computational Morphology and Phonology (SIGMORPHON2012), pages 1-9. Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan, Ann Sawyer. 2014. Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus. LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31. 10. Contact Information Zhiyi Song Collection Manager Stephanie Strassel BOLT PI Dana Fore Collection Coordinator Jonathan Wright Technical Manager ----------- README Created by Zhiyi Song, September 2, 2014 Updated by Zhiyi Song, November 11, 2015 Updated by Zhiyi Song, December 10, 2015 Updated by Dave Graff, December 10, 2015 Updated by Zhiyi Song, December 11, 2015 Updated by Zhiyi Song, August 8, 2016 Updated by Zhiyi Song, November 1, 2016 Updated by Stephanie Strassel, November 1, 2016