NPS Internet Chatroom Conversations, Release 1.0 consists of 10,567 English posts (45,068 tokens) gathered from age-specific chat rooms of various online chat services in October and November 2006. Each file is a text recording from one of these chat rooms for a short period on a particular day. Users should be aware that some of the conversations in this corpus feature subjects and language that some people may find offensive or objectionable, including discussions of a sexual nature. This corpus was developed by researchers at the Department of Computer Science, Naval Postgraduate School, Monterey, California.
Although much work has been accomplished in Natural Language Processing (NLP) in traditional written and spoken language domains, relatively little has been performed in the newer computer-mediated communication (CMC) domains enabled by the Internet, such as text-based chat. One factor inhibiting research in this area has been the dearth of annotated CMC corpora available to the broader research community, despite the increasing use of CMC in a variety of areas and applications. NPS Internet Chatroom Conversations is one of the first text-based chat corpora tagged with lexical and discourse information. This corpus might be used to develop stochastic NLP applications that perform tasks such as conversation thread topic detection, author profiling, entity identification, and social network analysis.
Each post is annotated with a chat dialog-act tag, and individual tokens within each post are annotated with part-of-speech tags. 3,507 tokenized posts were automatically tagged using a part-of-speech tagger trained on the Penn Treebank corpora, combined with a regular expression that identified privacy-masked user names and emoticons. Similarly, simple regular expression matching was employed to assign an initial chat dialog-act to each of this subset of posts. This initial tagging was verified by hand (with corrections made where necessary). The remaining 7,060 posts were POS-tagged using a POS tagger that was trained on the newly hand-tagged chat data and the Penn Treebank corpora. Dialog-act tagging on the remaining posts was accomplished using a back-propagation neural network trained on 21 features of the initial dialog-act-labeled posts. The tagging of this second group of posts was also manually verified (and corrected where necessary). Ultimately, all of the 10,567 privacy-masked posts, representing 45,068 tokens, were annotated with manually verified part-of-speech and dialog act information.
Filenames consist of date, target age group, and number of posts. For example, the file 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on October 19, 2006. The posts have been privacy-masked in two ways. First, all usernames have been changed to generic names of the form "UserN", where N is a unique integer consistently used for each respective poster across all files. The posts were then read by humans to remove other personally identifiable information. Within each file, usernames are prepended with the date and chat room portions of the filename. So in the above filename example, UserN becomes 10-19-20sUserN.
Please examine this sample for an example of the data in this corpus.
 Eric N. Forsyth and Craig H. Martell, "Lexical and Discourse Analysis of Online Chat Dialog," Proceedings of the First IEEE International Conference on Semantic Computing (ICSC 2007), pp. 19-26, September 2007.
 T. Wu, F. M. Khan, T. A. Fisher, L. A. Shuler and W. M. Pottenger, "Posting act tagging using transformation-based learning," Proceedings of the Workshop on Foundations of Data Mining and Discovery, IEEE International Conference on Data Mining, December 2002.
 A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema and M. Meteer, "Dialogue act modeling for automatic tagging and recognition of conversational speech," Computational Linguistics, vol. 26, no. 3, pp. 339-373, 2000.
 M. Zitzen and D. Stein, "Chat and conversation: a case of transmedial stability?" Linguistics, vol. 42, no. 5, pp. 983-1021, 2004.
Portions © 2010 Trustees of the University of Pennsylvania