BOLT English Translation Treebank - Chinese SMS/Chat CatalogID: LDC2021T19 Release date: December 15th, 2021 Linguistic Data Consortium Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick 1. Introduction This release of English Treebank consists of 108,385 tokens/words (106,783 tokens after translation alternates are removed) in 194 files of SMS/Chat text from various sources translated from Chinese to English and annotated for part-of-speech and syntactic structure. This data was previously released as subcorpora in earlier versions to the BOLT community; this publication consolidates the Phase 2 SMS/Chat ECTB data. The corpora that were released to the BOLT community previously had the catalog numbers LDC2014E44 (SMS/Chat Part 3 ECTB) and LDC2014E78 (SMS/Chat Part 4 ECTB). This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with the changes to those guidelines that were developed under the GALE (Global Autonomous Language Exploitation) program, primarily in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes, and updates to the syntactic annotation to comply with the most updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines can be found at https://repository.upenn.edu/cis_reports/1021/ Addenda detailing the more recent changes can be found at docs/EnglishTreebankSupplementalGuidelines.pdf and docs/WebtextTBAnnotationGuidelines.pdf. Additional guidelines developed to account for novel constructions, etc. in the SMS/Chat genre can be found in docs/TreebankGuidelines-ChatSMS-v1.3.pdf. All co-reference indices are shown on the syntactic node label, including reference indices on the node labels for the empty categories (as in all LDC English and Arabic Treebank releases). For the English translated from Chinese data in this release, the translation data text may also include both literal and fluent English translation alternates for certain idiomatic Chinese phrases. We are providing in this release a version of the annotated trees that has had the literal translation alternates and the associated metadata removed from the trees. These trees can be found in the following directory in this release: /data/translation-alternates-removed/penntree/ The usual file types for an LDC English Treebank release with the translation alternates included in the source and the trees can be found in this directory in this release: /data/translation-alternates-included/ We developed the following guidelines for annotating the translation alternates: 1. Both literal and fluent translation alternates are annotated for word-level tokenization and part-of-speech. Metadata punctuation tokens delimiting the alternate translations also receive the usual POS tags for these punctuation marks: (-LRB- [), (-RRB- ]), and (SYM |). 2. Only the fluent translation alternates are annotated as part of the syntactic structure of the tree. The syntactic node "META" is used for the open square bracket metadata preceding the fluent translation alternate, and also for the full extent of the literal translation alternate, including the pipe preceding it and the close square bracket following it. Syntactic structure is not annotated inside the META node. For example: /data/translation-alternates-included/penntree/CHT_CMN_20120319.0003.eng.xml.treex ( (INTJ (INTJ (UH Okay)) (, ,) (INTJ (META (-LRB- [)) (UH bye) (UH bye) (META (SYM |) (CD 88) (-RRB- ])))) ) Note that in a small number of cases there was some variation in the markup that is delimiting the translation alternates. For example, "l" appears in place of the expected "|", "}" for "]", the initial "[" bracket may be missing, and the final "]" bracket may be missing. For Treebank purposes, we have marked the actual translation alternates and the existing markup with META nodes, regardless of such variation. 3. The trees with the literal translation alternates removed in /data/translation-alternates-removed/penntree/ are obtained by removing all META nodes and their children. For example: /data/translation-alternates-removed/penntree/CHT_CMN_20120319.0003.eng.xml.tree ( (INTJ (INTJ (UH Okay)) (, ,) (INTJ (UH bye) (UH bye)))) Both versions of the trees are available in this release (including both literal and fluent translation alternates as in (2), and with the literal alternates removed as in (3)). A paper detailing the motivation and annotation of the translation alternates in the treebank can be found in docs/BiesEtal-LREC2014.pdf: Ann Bies, Justin Mott, Seth Kulick, Jennifer Garland, and Colin Warner. 2014. Incorporating Alternate Translations into English Translation Treebank. In Eds. Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), May 26-31, 2014, Reykjavik, Iceland. European Language Resources Association (ELRA), isbn 978-2-9517408-8-4. https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2014-alternate-translations-english-translation-treebank.pdf 2.2 Annotation Process The source files in data/translation-alternates-included/source_ascii-engtext consist of only the English text that is being treebanked. This text was extracted from files of the same name that will be released as part of the BOLT Chinese-English Word Alignment Training Data SMS/Chat. Word-level tokenization by script was manually corrected to be consistent with tokenization guidelines developed for English Treebanks in the GALE project. SU-level tokenization, however, was inherited from the SU annotation and translation, and was not adjusted during the treebank process (with the exception of multiple sentences contained in a single SU, which were necessarily treebanked as separate sentences). Higher level ASCII characters or non-ASCII characters in the text were reduced to comparable common ASCII characters, as detailed in docs/edit.list. A listing of all of the changed characters in this release is in docs/changed_chars.list. The tokenized data was run through an automatic POS tagger, and the tagger output was manually corrected to be consistent with the English Treebank part-of-speech annotation guidelines in https://repository.upenn.edu/cis_reports/570/ and addenda listed in Section 2.1. The corrected POS annotated data was adjusted manually and by script to insert the META nodes for the alternate translations and to insert pre-bracketed Ss for the SUs that contain multiple sentences. This data was run through an automatic parser. For this parser run, the pre-bracketed Ss were respected, and the META nodes were not automatically parsed. The parser output was manually corrected to be consistent with the English Treebank syntactic annotation guidelines in https://repository.upenn.edu/cis_reports/1021/ and addenda listed in Section 2.1. The first QC process consisted of a series of specific searches for approximately 300 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. We also included an additional QC process that identified repeated text and structures, and flagged non-matching annotations. Annotation errors found in this way have been manually corrected. The following papers describe this process: Seth Kulick, Ann Bies, and Justin Mott Using Derivation Trees for Treebank Error Detection ACL 2011, Portland, Oregon, USA, June 19-24, 2011 http://papers.ldc.upenn.edu/ACL2011/DerivationTrees_TBErrorDetection.pdf Seth Kulick and Ann Bies A TAG-derived Database for Treebank Search and Parser Analysis TAG+10: 10th International Workshop on Tree Adjoining Grammars and Related Formalisms, New Haven, CT, June 10-12, 2010 http://papers.ldc.upenn.edu/TAG2010/tag-paper-correct.pdf Lead annotators for this project were Justin Mott and Colin Warner. Additional annotators were Jonathan Gress-Wright, and Arrick Lanfranchi. 3. Source Data Profile 3.1 Data Selection Process Data was selected for this ECTB treebank annotation in order to maximize multiple annotations. The Chinese source data for these files has been SU-annotated, treebanked, and translated into English. 3.2 Data Sources and Epochs The data consists of English translations of Chinese source data SMS/Chat text from various sources collected by LDC. The Chinese source data has been released as BOLT Chinese SMS/Chat (LDC2018T15). This English text was extracted from files of the same name that will be released as part of the BOLT Chinese-English Word Alignment Training Data SMS/Chat. 4. Annotated Data Profile This data consists of 108,385 tokens/words (106,783 tokens after translation alternates are removed) in 194 files of SMS/Chat text from various sources translated from Chinese to English, all of which have been annotated for word-level tokenization, part-of-speech, and syntactic structure. The source files in data/translation-alternates-included/source_ascii-engtext consist of only the English text that is being treebanked. All annotation content is contained within metadata tags. Any non-ASCII characters in the original SU version of these files have been replaced with low level ASCII characters according to docs/edit.list. A listing of all of the changed characters in this release is in docs/changed_chars.list. 5. Directory Structure A listing of all of the files in this release can be found in docs/file.tbl. A listing of the base data filenames can be found in docs/file.ids. The directory structure is as follows: ./docs -- documentation files for this release. ./data/translation-alternates-included/source_ascii-engtext -- the English text extracted from the translation files. These files include both literal and fluent translation alternates. ./data/translation-alternates-included/ag_xml -- the annotation files in AG format, including all POS and treebank annotation as well as any comments from the annotators. These files include both literal and fluent translation alternates. ./data/translation-alternates-included/penntree -- the annotation files in Penn Treebank bracketed list style. These files include both literal and fluent translation alternates, annotated as discussed above. ./data/translation-alternates-removed/penntree -- the annotation files in Penn Treebank bracketed list style. These files do NOT include the literal translation alternates. These trees include only the fluent translation alternates. The META nodes containing the literal translation alternates have been removed from these trees, as discussed above. 6. File Format Description 6.1 *.flat (in data/translation-alternates-included/source_ascii-engtext) SU level tokens are separated by line breaks, and introduced by delimiters. Only round brackets, "(" and ")", are converted to -LRB-/-RRB- form, and any html character codes have been converted to their corresponding characters. Bracket representation is as follows: () are represented as -LRB- and -RRB- in the .tree files (this includes emoticons like :--RRB- ) [] {} <> and all other brackets are unchanged. 6.2 *.xml (in data/translation-alternates-included/ag_xml/) TreeEditor .xml stand-off annotation files. These files contain only the POS and Treebank annotations and reference the source files in data/translation-alternates-included/source_ascii-engtext by character offset. 6.3 *.tree (in data/translation-alternates-included/penntree/ and in data/translation-alternates-removed/penntree) Bracketed tree files following the basic form (NODE (TAG token)). Each SU is surrounded by a pair of unlabeled parentheses. Sample: ( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) ) Usually this means that a pair of unlabeled parentheses surrounds each sentence, but for SUs containing more than one sentence, the pair of unlabeled parentheses will contain all necessary top level sentences. 7. Data Validation automatic tokenization => human correction of tokenization => automatic pre-tag => human correction and annotation of Part-of-Speech => insertion of META nodes and S nodes for SUs with multiple sentences => automatic pre-parse => human correction and annotation of syntactic structure => QC correction, search-based => additional QC correction, KBM-based The first QC process consisted of a series of specific searches for approximately 200 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. We have added an additional QC process that identified repeated text and structures, and flagged non-matching annotations. Annotation errors found in this way have been manually corrected. See Section 2.2 above for references to papers that describe this process. 8. DTDs The DTD files for the AG are kept in data/translation-alternates-included/ag_xml, as well as in the dtds/ directory. 9. Copyright Information Portions (c) 2012-2021 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, bies@ldc.upenn.edu 11. Update Log This index was updated on April 10, 2019 by Seth Kulick