English Web Treebank CatalogID: LDC2012T13 Release date: August 1, 2012 Linguistic Data Consortium Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick 1.0 Introduction This release of the English Web Treebank consists of 254,830 word-level tokens (16,624 sentence-level tokens) of web text in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and answers. This data is consistent with the sentence-level tokenization outlined in WebtextTBAnnotationGuidelines.pdf, with the word-level tokenization developed cross-site for English Treebanks under the DARPA GALE project, and with the annotation guidelines for the English Translation Treebanks at http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/. Additional guidelines developed specifically to account for novel constructions, etc. in the web text genre can be found in docs/WebtextTBAnnotationGuidelines.pdf. Google Inc. and the Linguistic Data Consortium collaborated on the creation of this resource, funded through a gift from Google. 2. Annotation 2.1 Tasks and Guidelines This release includes tokenization, part-of-speech, and syntactic treebank annotation. The data is consistent with the sentence-level tokenization outlined in WebtextTBAnnotationGuidelines.pdf, with the word-level tokenization developed cross-site for English Treebanks under the DARPA GALE project, and with the annotation guidelines for the English Translation Treebanks at http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/. Additional guidelines developed specifically to account for novel constructions, etc. in the web text genre can be found in docs/WebtextTBAnnotationGuidelines.pdf. "Sentence" level tokens are separated by line breaks, and introduced by delimiters. "Word" level tokens are separated by white space. All brackets except for "(" and ")" are converted out of -LCB-/etc. form, and all html character codes have been converted to their corresponding characters. Bracket representation is as follows: () are represented as -LRB- and -RRB- in the .tree files (this includes emoticons like :--RRB- ) [] {} <> and all other brackets are unchanged. In addition, we have converted out of the .html character codes at all levels of annotation: " " > > %lt; < & & A note on the sentence-level tokenization: The source data for previous English Treebanks created at LDC has been the output of of either careful human editing or translation. As such, line breaks in the source data provided a reliable starting point for syntactic annotation. Because of this, an almost entirely automated system was used to determine the boundaries between sentences for those corpora. However, these deterministic methods of splitting sentences are not wholly adequate for web text data. We have therefore developed new guidelines for sentence-level tokenization for the English Web Treebank, which are included in WebtextTBAnnotationGuidelines.pdf. The resulting files in the data//source/source_text_ascii_tokenized directories are the data that feeds into the rest of the treebank annotation pipeline. 2.2 Annotation Process Text data was extracted from the original source files by script, taking the linguistic text only from the title and the body, or the equivalent, in each genre (i.e., not including other headers, markup, metadata, etc.). Higher level ASCII characters or non-ASCII characters in the text were reduced to comparable common ASCII characters, as detailed in docs/edit.list. Tokenization by script was manually corrected to be consistent with tokenization guidelines developed for English Treebanks in the GALE project and with WebtextTBAnnotationGuidelines.pdf. The tokenized data was run through an automatic POS tagger, and the tagger output was manually corrected to be consistent with the English Treebank part-of-speech annotation guidelines in http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/ and WebtextTBAnnotationGuidelines.pdf. The corrected POS annotated data was run through an automatic parser, and the parser output was manually corrected to be consistent with the English Treebank syntactic annotation guidelines in http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/ and WebtextTBAnnotationGuidelines.pdf. The first QC process consists of a series of specific searches for approximately 200 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. Lead annotators for this project were Justin Mott and Colin Warner. The additional annotator was John Laury. With this project we have added an additional QC process that identifies repeated text and structures, and flags non-matching annotations. Annotation errors found in this way have been manually corrected. This error detection system is work in progress and still in the process of refinement, but we are pleased to be able to include it in the annotation pipeline for the first time with this project. The following papers describe this process: Seth Kulick, Ann Bies, and Justin Mott Using Derivation Trees for Treebank Error Detection ACL 2011, Portland, Oregon, USA, June 19-24, 2011 http://papers.ldc.upenn.edu/ACL2011/DerivationTrees_TBErrorDetection.pdf Seth Kulick and Ann Bies A TAG-derived Database for Treebank Search and Parser Analysis TAG+10: 10th International Workshop on Tree Adjoining Grammars and Related Formalisms, New Haven, CT, June 10-12, 2010 http://papers.ldc.upenn.edu/TAG2010/tag-paper-correct.pdf 3. Source Data Profile 3.1 Data Selection Process 3.1.1 Weblog The source data for Weblog was weblog data previously collected for other projects by LDC. Data selection criteria for Weblog were as follows. Texts were selected to be 1. In the weblog domain 2. Taken from existing LDC data collections 3. Free of technical and formatting difficulties, to the extent possible 4. Appropriate and on target content (i.e., not spam, not adult content) 5. The entire text of the post, excluding quoted material from previous posts 6. Linguistic text only (i.e., text from HEADLINE and POST fields, or the equivalent, not POSTER, POSTDATE, etc.) 3.1.2 Newsgroups The source data for Newsgroups was newsgroup data previously collected for other projects by LDC. Data selection criteria for Newsgroups were as follows. Texts were selected to be 1. In the newsgroup domain 2. Taken from existing LDC data collections 3. Free of technical and formatting difficulties, to the extent possible 4. Appropriate and on target content (i.e., not spam, not adult content) 5. The entire text of the post, excluding quoted material from previous posts 6. Linguistic text only (i.e., text from HEADLINE and POST fields, or the equivalent, not POSTER, POSTDATE, etc.) 3.1.3 Email The source data for this subcorpus was selected from the EnronSent Corpus, a public domain corpus of email text prepared by Will Styler at the University of Colorado at Boudler: Styler, Will (2011). The EnronSent Corpus. Technical Report 01-2011, University of Colorado at Boulder Institute of Cognitive Science, Boulder, CO. A description of the EnronSent Corpus from the website http://verbs.colorado.edu/enronsent/ is as follows: "The EnronSent corpus is a special preparation of a portion of the Enron Email Dataset designed specifically for use in Corpus Linguistics and language analysis. Divided across 45 plain text files, this corpus contains 2,205,910 lines and 13,810,266 words." LDC hand selected the data to be annotated for our project's subcorpus using the following criteria: 1. Approximately 1000 words from each data file in the Colorado set (starting at randomly generated line numbers in each file, and manually taking c. 1000 words of appropriate text starting with the first message that begins after that line number, and going forward), which yields a bit over 40K total for annotation. 2. We used subject line and body text only (i.e., did not include headers, etc., parallel to our WB and NG subcorpora data selection). 3. We avoided quoted text, forwarded text and spam when possible (i.e., hand selected the text/messages that were not quoted, etc., also as with the WB and NG data selection), but some repeated text remains in this subcorpus. For this reason, we increased the total amount of data selected for this subcorpus, compared to the other web text subcorpora. All of this data was already converted to ASCII as part of the University of Colorado at Boudler public domain release. In other web text subcorpora, we have reduced higher level ASCII characters or non-ASCII characters in the text to comparable common ASCII characters, as detailed in docs/edit.list. However, this step was not necessary for the Email subcorpus. 3.1.4 Reviews The source data for Reviews was supplied by Google from internal reviews collections. Data selection criteria for Reviews were as follows. 1. Filtered for being sourced by Google's internal review collection service. These are local service reviews (e.g., hotels, restaurants, etc). 2. Filtered for English. This is based on what the user enters or what Google predicts based on the users locality. Individual reviews that were primarily non-English were removed from the annotated portion of the corpus. 3. Filtered to remove SSNs, credit cards, phone numbers and IP addresses. 4. Only contains the title and body of the review (no dates, user names, service category, merchant name, etc.). 5. Sampled from the full set of 2.5M reviews randomly ensuring that each review at least has some body text (the reviews that were empty, i.e., just a star rating, were thrown out). 6. Each review was assigned a sequential number (0-399999). 7. Reviews totaling approximately 50K words were selected for treebank annotation. This resulted in a total of 723 files (one review per file). A few were hand-selected, but more than 90% were randomly selected. 3.1.5 Answers The source data for Answers was collected by LDC for this project. Using Yahoo's API, we harvested Questions, which include both the Subject field and the Content field, and Answers, which include both the Chosen Answer (if an answer was marked as "chosen" on Yahoo! Answers, which is not always the case), and two additional Answers (or up to three total, if three answers were present). These fields are marked with the following metadata in the originally harvested files, the .xml files in /data/answers/source/source_original. This metadata is not included or annotated in the treebanked files. Data selection criteria for the annotated data are as follows: Questions and answers were selected for annotation. The question categories were chosen both by the annotators and randomly. Question/answer pairs within the categories were chosen randomly, and then further selected manually as follows. 1. Relevance -- answers should be relevant to the questions. 2. English -- questions and answers entirely in a language other than English were eliminated from the corpus. Questions and answers with brief excursions into non-English were kept in the corpus if it was judged to be expected and acceptable foreign language usage within English. 3. Answers present -- questions in the corpus have at least one answer present, with the following three exceptions. These three files are included in the corpus, although they include only questions and not answers: 20111106112223AAmR2im_ans.xml 20111106193025AA2h62V_ans.xml 20111107173321AASNJjc_ans.xml In all remaining files, up to three total answers were harvested, if that many answers were available. 4. Appropriate content -- questions and answers in the corpus were manually filtered for appropriate content. 3.2 Data Sources and Epochs 3.2.1 Weblog The data consists of English weblog text from various sources dating from 2003 to 2006. 3.2.2 Newsgroups The data consists of English newsgroup text from various sources dating from 2003 to 2006. 3.2.3 Email The data consists of English email text from the EnronSent Corpus, dating approximately from the late 1990's and early 2000's. 3.2.4 Reviews The data consists of English Google review text. (Dates may be unknown.) 3.2.5 Answers The data consists of English Yahoo! Answers text. (Dates are unknown.) 4. Annotated Data Profile The data consists of 1174 files of web text, and a total of 254,830 word-level tokens. 5. Data Directory Structure A listing of all of the files in this release can be found in docs/file.tbl. A listing of the data filenames can be found in docs/file.ids.. See section 6. File Format Description below for details on the various file formats. The data directory structure is as follows: ./data//source/source_original -- the text from the original source data or the original source files as collected ./data//source/source_text_ascii_tokenized -- linguistic text only, extracted from the subject and body fields (or the equivalent), manually annotated with sentence and word level tokenization, with any higher-level ASCII characters or non-ASCII characters changed into common ASCII characters (details in docs/edit.list) ./data//xml -- TreeEditor xml stand-off annotation files, referencing the source_text_ascii_tokenized .txt files by character offset ./data//penntree -- the annotation files in Penn Treebank bracketed list style 6. File Format Description 6.1 *.txt (in data//source/source_text_ascii_tokenized/) Tokenized source files. These contain one sentence per line, with all token boundaries represented by whitespace. Additionally, each line starts with a sentence ID number of the form . This is a formatting requirement for our current version of the TreeEditor annotation tool. Sample sentence from blogspot.com_aggressivevoicedaily_20060629164800_ENG_20060629_164800. I 'll post highlights from the opinion and dissents when I 'm finished . 6.2 *.xml (in data//xml/) TreeEditor .xml stand-off annotation files. These files contain only the POS and Treebank annotation and reference the .txt files by character offset. TreeEditor typically references the original source file rather than tokenization output as we are doing here. We made this switch for these corpora to aid in the handling of sentences spanning multiple lines in the source text. 6.3 *.tree (in data//penntree/) Bracketed tree files following the basic form (NODE (TAG token)). Each sentence is surrounded by a pair of empty parentheses. Sample: ( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) ) 7. Data Validation Care is taken to maintain the integrity of the data at each step. 8. DTDs The DTD files for the AG are kept in the same directory where the xml files are: data//xml/ 9. Copyright Information Portions (c) 2011, 2012 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, bies@ldc.upenn.edu 11. Update Log This index was updated on August 1, 2012 by Ann Bies.