BOLT English Treebank - Discussion Forum CatalogID: LDC2019T15 Release date: October 15, 2019 Linguistic Data Consortium Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick 1. Introduction This corpus of English Treebank consists of 268,907 tokens/words in 702 files of discussion forum text from various sources annotated for part-of-speech and syntactic structure. This data was previously released as subcorpora in earlier versions to the BOLT community; this publication consolidates the Phase 1 English Treebank discussion forum (DF) data. The corpora that were released to the BOLT community previously had the catalog numbers LDC2012E92 (DF Part 1), LDC2012E97 (DF Part 2), LDC2012E114 (DF Part 3), LDC2013E17 (DF Part 4), LDC2013E40 (DF Part 5). This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with the changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program, primarily in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes, and updates to the syntactic annotation to comply with the most updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines can be found at https://repository.upenn.edu/cis_reports/1021/ Addenda detailing the more recent changes can be found at docs/etb-supplementary-guidelines-2009-addendum.pdf and docs/etb-webtext-guidelines.pdf. All co-reference indices are shown on the syntactic node label, including reference indices on the node labels for the empty categories (as in all LDC English and Arabic Treebank releases). 2.2 Annotation Process The English text to be treebanked was extracted from files of English Discussion Forum that had already undergone SU annotation at LDC. Higher level ASCII characters or non-ASCII characters in the text were reduced to comparable common ASCII characters, as detailed in docs/edit.list. Word-level tokenization by script was manually corrected to be consistent with tokenization guidelines developed for English Treebanks in the GALE project. SU-level tokenization, however, was inherited from the SU annotation, and was not adjusted during the treebank process (with the exception of multiple sentences contained in a single SU, which were necessarily treebanked as separate sentences). The tokenized data was run through an automatic POS tagger, and the tagger output was manually corrected to be consistent with the English Treebank part-of-speech annotation guidelines in https://repository.upenn.edu/cis_reports/570/ and addenda listed in Section 2.1. The corrected POS annotated data was run through an automatic parser, and the parser output was manually corrected to be consistent with the English Treebank syntactic annotation guidelines https://repository.upenn.edu/cis_reports/1021/ and addenda listed in Section 2.1. The first QC process consists of a series of specific searches for approximately 200 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. We also included an additional QC process that identifies repeated text and structures, and flags non-matching annotations. Annotation errors found in this way have been manually corrected. This error detection system was work in progress and in the process of refinement at the time of this annotation. The following papers describe this process: Seth Kulick, Ann Bies, and Justin Mott Using Derivation Trees for Treebank Error Detection ACL 2011, Portland, Oregon, USA, June 19-24, 2011 http://papers.ldc.upenn.edu/ACL2011/DerivationTrees_TBErrorDetection.pdf Seth Kulick and Ann Bies A TAG-derived Database for Treebank Search and Parser Analysis TAG+10: 10th International Workshop on Tree Adjoining Grammars and Related Formalisms, New Haven, CT, June 10-12, 2010 http://papers.ldc.upenn.edu/TAG2010/tag-paper-correct.pdf Lead annotators for this project were Justin Mott and Colin Warner. Additional annotators were John Laury, Myke Eggers, Jonathan Gress-Wright, and Arrick Lanfranchi. 3. Source Data Profile 3.1 Data Selection Process Data was selected for treebank annotation from files of of English Discussion Forum that had already undergone SU annotation at LDC. 3.2 Data Sources and Epochs The data consists of English source data discussion forum text from various sources collected by LDC in 2011 and 2012. 4. Annotated Data Profile This data consists of 702 files of discussion forum text from various sources and a total of 268,907 tokens, all of which have been annotated for word-level tokenization, part-of-speech and syntactic structure. The source files in data/source_ascii-engtext consist of only the English text that is being treebanked. All annotation content is contained within metadata tags. Any non-ASCII characters in the original .tdf version of these files have been replaced with low level ASCII characters according to docs/edit.list. 5. Data Directory Structure A listing of all of the files in this release can be found in docs/file.tbl. A listing of the base data filenames can be found in docs/file.ids. The data directory structure is as follows: ./docs ./data/source_ascii-engtext -- the English text extracted from the SU files, with any higher-level ASCII characters or non-ASCII characters changed into common ASCII characters (details in docs/edit.list) ./data/ag_xml -- the annotation files in AG format, including all POS and treebank annotation as well as any comments from the annotators ./data/penntree -- the annotation files in Penn Treebank bracketed list style ./data ./dtds 6. File Format Description 6.1 *.flat (in data/source_ascii-engtext) SU level tokens are separated by line breaks, and introduced by delimiters. All brackets except for "(" and ")" are converted out of -LCB-/etc. form, and all html character codes have been converted to their corresponding characters. Bracket representation is as follows: () are represented as -LRB- and -RRB- in the .tree files (this includes emoticons like :--RRB- ) [] {} <> and all other brackets are unchanged. Any higher-level ASCII characters or non-ASCII characters from the SU files are changed to common ASCII characters (details in docs/edit.list). 6.2 *.xml (in data/ag_xml/) TreeEditor .xml stand-off annotation files. These files contain only the POS and Treebank annotation and reference the source files in data/source_ascii-engtext by character offset. 6.3 *.tree (in data/penntree/) Bracketed tree files following the basic form (NODE (TAG token)). Each SU is surrounded by a pair of empty parentheses. Sample: ( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) ) Usually this means that a pair of empty parentheses surrounds each sentence, but for SUs containing more than one sentence, the pair of empty parentheses will contain all necessary top level sentences. 7. Data Validation automatic tokenization => human correction of tokenization => automatic pre-tag => human correction and annotation of Part-of-Speech => automatic pre-parse => human correction and annotation of syntactic structure => QC correction => additional QC correction The first QC process consists of a series of specific searches for approximately 200 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. We have added an additional QC process that identifies repeated text and structures, and flags non-matching annotations. Annotation errors found in this way have been manually corrected. See Section 2.2 above for references to papers that describe this process. 8. DTDs The DTD files for the AG are kept in data/ag_xml, as well as in the dtds/ directory. 9. Copyright Information Portions (c) 2012-2019 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, bies@ldc.upenn.edu 11. Update Log This index was updated on January 24, 2019 by Seth Kulick