Title: BOLT Chinese SMS/Chat Authors: Zhiyi Song, Dana Fore, Stephanie Strassel, Haejoong Lee, Jonathan Wright 1. Introduction This file contains documentation for the BOLT Chinese SMS/Chat Corpus. This corpus consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving native speakers of Chinese. The DARPA BOLT (Broad Operational Language Translation) Program developed genre-independent machine translation and information retrieval systems. While earlier DARPA programs made significant strides in improving natural language processing capabilities in structured genres like newswire and broadcasts, BOLT was particularly concerned with improving translation and information retrieval performance for less-formal genres with a special focus on user-contributed content. LDC supported the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat in Chinese, Egyptian Arabic and English. The collected data was translated and richly annotated for a variety of tasks including word alignment, Treebanking, PropBanking, and co-reference. LDC supported the evaluation of BOLT technologies by post-editing machine translation system output and assessing information retrieval system responses during annual evaluations conducted by NIST. This corpus comprises the Chinese training data collected for BOLT phases 2 and 3. The corpus contains SMS and chat conversations between two or more native Chinese speakers. The bulk of the data in this release consists of naturally-occurring, pre-existing SMS or chat message archives donated by consented Chinese speakers. Donated data is supplemented by new conversations among people who know one another, collected by LDC specifically for BOLT using a custom collection platform. All data was obtained with the informed consent of the Chinese speakers. Collected and donated data was manually reviewed by LDC to exclude any messages that were not in the target language or that had potentially sensitive content, such as personal identifying information (PII). The corpus contains 14,877 conversations totaling 3,005,810 words across 497,543 messages. Section 4 below describes the data collection, auditing process in detail. 2. Package structure README.txt - this file data/ - source conversations docs/ - directory containing package documents conversation0.2.1.dtd - a DTD for .conv.xml files in the data/source directory gt24hrs.txt - list of conversations that contain a gap between messages larger than 24 hours chinese_sms_chat_source_collection.tab - source file list with message and word count The filenaming convention for xml files is by conversation ID, which is __. where is one of the genres: SMS and CHT is CMN which stands for Chinese is YYYYMMDD date of the first message of a conversation four-digit identifier 3.0 Contents The tables below show the quantity of source collection by genre: Source: +----------+----------+------------+-------------+-------------+------------+ | language | genre | collection |num_conv | num_message | num_word | +----------+----------+------------+-------------+-------------+------------+ | cmn | sms | donated | 1,809 | 19,889 | 159,564 | +----------+----------+------------+-------------+-------------+------------+ | cmn | sms | live | 36 | 752 | 4,788 | +----------+----------+------------+-------------+-------------+------------+ | cmn | cht | donated | 13,032 | 476,902 | 2,841,458 | +----------+----------+------------+-------------+-------------+------------+ | total | -- | -- | 14,877 | 497,543 | 3,005,810 | +----------+----------+------------+-------------+-------------+------------+ Note: Word counts are calculated using a correspondence of 1.5 Chinese characters per word. The file chinese_sms_chat_source_collection.tab in the docs/ directory lists the inventory of documents with relevant quantities of messages, words and Chinese character counts. 4. BOLT SMS and Chat Collection Pipeline The data in this release was collected using two methods: new collection via LDC's collection platform, and donation of SMS or chat archives from BOLT collection participants. All data collected were reviewed manually to exclude any messages/conversations that were not in the target language or that had sensitive content, such as personal identifying information (PII). 4.1 LDC's SMS and Chat Collection Platform For text messaging (SMS) collection, LDC's collection platform initiated each session by sending a text message to a pair of consented participants, introducing them to one another and inviting them to begin texting. The participants were native Chinese speakers who were typically known to one another but could be strangers. Participants replied to the initiating message to start the conversation. The collection platform relayed messages between the participants, so they experienced normal SMS conversations. Relayed messages were stored in LDC's database along with participant and conversation metadata. For chat messaging collection, LDC's chat robot sent a message to each participant pair inviting them to start a session. As with the SMS collection, the participants were typically known to one another but could be strangers. The participants carried on a discussion and the robot captured the conversation. All conversations were stored in the collection database along with participant and conversation metadata. For both SMS and chat collections, there was no suggested topic and participants were free to discuss any topic of their own choosing. For SMS and chat data from live collection, a conversation was defined as messages between a pair of participants within a 24-hour time frame. 4.2 SMS and Chat Collection from Donations Consented, native Chinese speaking participants followed LDC's instructions to create an archive of their SMS or chat data from their phone or computer and upload the archive to LDC's collection site. Participants had an opportunity to edit their archives prior to final upload to exclude any data they didn't want to donate. Participants could delete entire messages and/or search their messages and redact specific content, using a simple GUI developed by LDC. Redacted content was replaced with "#", preserving a one-to-one character mapping. Post-processing of the uploaded archive included checking for duplication, doing a simple automated language ID, and dividing the archive into conversations. An archive is first automatically divided into groups of messages between particular sets of SMS/chat partners, and those message groups are further subdivided into conversations every time a chat partner takes more than 24 hours to respond. For example: an archive contains messages from Person A's phone. It has conversations involving Person A, B and C chatting, which we'll call them Group 1. Person A is chatting separately with Person D; that's Group 2. In Group 2, Person D has for some reason not replied to a message sent by Person A at 3pm yesterday, until 7pm today - that's 28 hours between messages, so Group 2 will have two conversations: those messages before 3pm yesterday, and those after 7pm today. So in the end, the archive from Person A may be divided into multiple conversations. 4.3 Auditing After collection, each conversation was audited by LDC to ensure compliance with language requirements and to flag: - any sensitive personal identifying information (PII) - messages not in the target language - messages that are duplicates - auto-generated messages by Chat clients Messages/conversations not in the target language or containing PII or sensitive content was removed from the corpus. Messages that are predominantly in target language with occasional words in a different language are retained. Messages consisted solely of auto-generated mark-up are retained in the source files. For example: <media omitted> 5. Data Format 5.1 Source Data Conversation Format The conv.xml files have the following format: Medium value is either SMS or CHT (chat) and donated value is either true or false where true indicates the conversation is from a donated archive and false indicates the conversation was collected via LDC's SMS and chat collection platform. Reserved characters such as "&" have been escaped using the standard format (e.g., "&"). Proper ingesting of XML data requires an XML parsing library. The conversation_id is the file name minus the extension. Each message has message id, subject id, and date attributes and contains a message body. For more information see docs/conversation0.2.1.dtd 6. Data Processing Data was originally in a variety of formats, due to differences between donated and collected data. These formats were normalized; the content of message bodies was not altered except to convert from UTF-16 to UTF-8, replacing carriage returns with newlines, and removing apparently extraneous newlines and quotes from the periphery of messages. Internal newlines may still occur when they are part of the content entered by the message sender. Dates were converted to UTC, and the various original means of identifying participants were converted to LDC subject IDs. Participant IDs are assigned consistently within each donated archive, but LDC did not make any effort to normalize participant IDs across donated archives, as such information is not consistently available in the donations. Message IDs were assigned, local to each conversation, starting at m0000, based on the message order by date-time, which is also the order in which messages are displayed in the output. Note that if a message is deleted from a conversation during auditing, the message number sequence will reflect the deletion in that it will have non-contiguous numbering. For example, if a conversation originally contained 6 messages but the third message is deleted during auditing because it contains PII, the conversation xml will contain messages with IDs m0000, m0001, m0003, m0004, and m0005. If participants delete certain messages before uploading their archive, LDC has no way of detecting this. Therefore, conversations with message IDs whose numbering is continuous will not necessary have continuity of content. Conversation IDs were assigned based on medium, language, and the date of the first message. Donated messages were extracted from various applications and devices. These different sources use varying styles of newlines. For simplicity and consistency all newlines have been converted to use the single-character, Unix-style line-feed, "\n." 7. Known Issues Some conversations included a range of emoticon characters whose Unicode code-point values occupy the "Private Use Area" of the Unicode character table. These characters have been left in place. Some conversations contain a gap between messages greater than 24 hours in duration. A list of these conversations may be found in docs/gap24hrs.tab. There are two possible reasons behind the issue: - Some conversations were donated and processed before the 24-hour rule was implemented. - During auditing, some messages were flagged and hence excluded, which then increased the gap between surrounding messages to more than 24 hours These files are being left as-is, containing an over-long gap within the message sequence, rather than being split into separate conversations. 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. The authors acknowledge Kevin Walker, Jennifer Garland, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan, Stephen Grimes, David Graff, Will Haun and Ann Sawyer for their help and support in collection infrastructure, data processing, delivery preparation and documentation. 9. References Zhiyi Song, Stephanie Strassel, Haejoong Lee, Kevin Walker, Jonathan Wright, Jennifer Garland, Dana Fore, Brian Gainor, Preston Cabe, Thomas Thomas, Brendan Callahan, Ann Sawyer. 2014. Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus. LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31. 10. Contact Information Zhiyi Song Collection Manager Stephanie Strassel BOLT PI Dana Fore Collection Coordinator Jonathan Wright Technical Manager ----------- README Created by Zhiyi Song, September 2, 2014 Updated by Zhiyi Song, November 11, 2015 Updated by Zhiyi Song, December 10, 2015 Updated by Dave Graff, December 10, 2015 Updated by Zhiyi Song, December 11, 2015 Updated by Zhiyi Song, August 8, 2016 Updated by Zhiyi Song, November 1, 2016 Updated by Stephanie Strassel, November 1, 2016 Updated by Zhiyi Song, February 27, 2018