BOLT English Treebank - Discussion Forum ECTB 
CatalogID: LDC2020T09
Release date: May 15, 2020
Linguistic Data Consortium
Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick

1. Introduction

This release of English Treebank consists of 147,432 tokens/words
(145,221 tokens after translation alternates are removed) in 148 files
of discussion forum text from various sources translated from Chinese
to English and annotated for part-of-speech and syntactic structure.

This data was previously released as subcorpora in earlier versions to the
BOLT community; this publication consolidates the Phase 1 English Treebank
Discussion Forum ECTB data. The corpora that were released to the BOLT
community previously had the catalog numbers LDC2013E50
(DF Part 6 V1.1 ECTB) and LDC2013E76 (DF Part 7 v 1.0 ECTB).

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145.
The content does not necessarily reflect the position or the policy of
the Government, and no official endorsement should be inferred.

2. Annotation

2.1 Tasks and Guidelines

The guidelines followed for both part-of-speech and treebank
annotation are essentially Penn Treebank II style, with the changes to
those guidelines that were developed under the GALE (Global Autonomous
Language Exploitation) program, primarily in the tokenization of
hyphenated words, part-of-speech and tree changes necessitated
by those tokenization changes, and updates to the syntactic annotation
to comply with the most updated annotation guidelines
(including the "Treebank-PropBank merge" or "Treebank IIa"
and "treebank c" changes).  The original Penn Treebank II guidelines can
be found at https://repository.upenn.edu/cis_reports/1021/
Addenda detailing the more recent changes can be found at
docs/etb-supplementary-guidelines-2009-addendum.pdf and
docs/etb-webtext-guidelines.pdf.

All co-reference indices are shown on the syntactic node label,
including reference indices on the node labels for the empty
categories (as in all LDC English and Arabic Treebank releases).

For the English translated from Chinese data in this release, the
translation data text may also include both literal and fluent English
translation alternates for certain idiomatic Chinese phrases.

We are providing in this release a version of the annotated trees that has
had the literal translation alternates and the associated metadata removed
from the trees.  These trees can be found in the following directory in this
release:
/data/translation-alternates-removed/penntree/

The usual file types for an LDC English Treebank release with the
translation alternates included in the source and the trees can be
found in this directory in this release:
/data/translation-alternates-included/

We developed the following guidelines for annotating the
translation alternates:

1. Both literal and fluent translation alternates are annotated for
word-level tokenization and part-of-speech.  Metadata punctuation
tokens delimiting the alternate translations also receive the usual
POS tags for these punctuation marks: 
(-LRB- [), (-RRB- ]), and (SYM |).

2. Only the fluent translation alternates are annotated as part of the
syntactic structure of the tree.  

The syntactic node "META" is used for the open square bracket metadata
preceding the fluent translation alternate, and also for the full
extent of the literal translation alternate, including the pipe
preceding it and the close square bracket following it.  Syntactic
structure is not annotated inside the META node.

For example:

/data/translation-alternates-included/penntree/bolt-cmn-DF-196-185661-3108009.eng.xml.tree:
( (PP (IN Unlike) (NP (NP (DT those) (NNS pigs) (, ,) (NNS sheep) (, ,) (CONJP (RB as) (RB well) (IN as)) (NNS dogs)) (, ,) (SBAR (WHNP-9 (WDT which)) (S (NP-SBJ-9 (-NONE- *T*)) (VP (VBP are) (VP (META (-LRB- [)) (VBN killed) (NP-9 (-NONE- *)) (ADVP-MNR (RB shortly) (META (SYM |) (JJ white) (NN knife) (RB in) (CC and) (JJ red) (NN knife) (RB out) (-RRB- ])) (, ,) (RB simply) (CC and) (RB directly))))))) (. .)) )

Note that in a small number of cases there was some variation in the
markup that is delimiting the translation alternates.  For example,
"l" appears in place of the expected "|", "}" for "]", the initial "["
bracket may be missing, and the final "]" bracket may be missing.  For
Treebank purposes, we have marked the actual translation alternates
and the existing markup with META nodes, regardless of such variation.

3. The trees with the literal translation alternates removed in
/data/translation-alternates-removed/penntree/ are obtained by
removing all META nodes and their children.

For example:

/data/translation-alternates-removed/penntree/bolt-cmn-DF-196-185661-3108009.eng.xml.tree:
( (PP (IN Unlike)  (NP (NP (DT those)  (NNS pigs)  (, ,)  (NNS sheep)  (, ,)  (CONJP (RB as) (RB well) (IN as))  (NNS dogs))  (, ,)  (SBAR (WHNP-1 (WDT which))  (S (NP-SBJ-1 (-NONE- *T*))  (VP (VBP are)  (VP (VBN killed)  (NP-1 (-NONE- *))  (ADVP-MNR (RB shortly) (, ,) (RB simply) (CC and) (RB directly)))))))  (. .)) )

Both versions of the trees are available in this release (including
both literal and fluent translation alternates as in (2), and with the
literal alternates removed as in (3)).
In addition, the metadata span
information and the token spans for the Chinese characters are
contained in word alignment release (see section 3.2).

2.2 Annotation Process

The English text to be treebanked was extracted by script from files
of the same name that were released as part of
BOLT Chinese Discussion Forum Parallel Training Data (LDC2017T05)

Word-level tokenization by script was manually corrected to be
consistent with tokenization guidelines developed for English
Treebanks in the GALE project. SU-level tokenization, however,
was inherited from the SU annotation and translation, and was not
adjusted during the treebank process (with the exception of multiple
sentences contained in a single SU, which were necessarily treebanked
as separate sentences).

The tokenized data was run through an automatic POS tagger, and the
tagger output was manually corrected to be consistent with the English
Treebank part-of-speech annotation guidelines in
https://repository.upenn.edu/cis_reports/570/
and addenda listed in Section 2.1.

The corrected POS annotated data was adjusted manually and by script
to insert the META nodes for the alternate translations and to insert
pre-bracketed Ss for the SUs that contain multiple sentences.  This
data was run through an automatic parser.  For this parser run, the
pre-bracketed Ss were respected, and the META nodes were not
automatically parsed.

The parser output was manually corrected to be consistent with the
English Treebank syntactic annotation guidelines in
https://repository.upenn.edu/cis_reports/1021/
and addenda listed in Section 2.1.

The first QC process consists of a series of specific searches for
approximately 200 types of potential inconsistency and parser or
annotation error.  Any errors found in these searches were hand
corrected.

We also included an additional QC process that identifies repeated text
and structures, and flags non-matching annotations.  Annotation errors
found in this way have been manually corrected.  This error detection
system was work in progress and in the process of refinement at the
time this annotation.

The following papers describe this process:

Seth Kulick, Ann Bies, and Justin Mott
Using Derivation Trees for Treebank Error Detection
ACL 2011, Portland, Oregon, USA, June 19-24, 2011
https://papers.ldc.upenn.edu/ACL2011/DerivationTrees_TBErrorDetection.pdf

Seth Kulick and Ann Bies
A TAG-derived Database for Treebank Search and Parser Analysis
TAG+10: 10th International Workshop on Tree Adjoining Grammars and Related Formalisms, New Haven, CT, June 10-12, 2010
https://papers.ldc.upenn.edu/TAG2010/tag-paper-correct.pdf

Lead annotators for this project were Justin Mott and Colin Warner.
Additional annotators were John Laury, Myke Eggers, Jonathan
Gress-Wright, and Arrick Lanfranchi.

3. Source Data Profile

3.1 Data Selection Process

Data was selected for this ECTB treebank annotation in order to
maximize multiple annotations.  The Chinese source data for these
files has been SU-annotated, treebanked, and translated into English.

3.2 Data Sources and Epochs

The data consists of English translations of Chinese source data
discussion forum text from various sources collected by LDC in 2011
and 2012.  The Chinese source data has been released as BOLT Chinese
Discussion Forums (LDC2016T05), the Chinese-English word alignment
data has been released as BOLT Chinese-English Word Alignment and Tagging
(LDC2016T19) and the Chinese Treebank of the Chinese data has been
released as part of Chinese Treebank 9.0 (LDC2016T13).

4. Annotated Data Profile

This data consists of 147,432 tokens/words (145,221 tokens after
translation alternates are removed) in 148 files of discussion forum
text from various sources translated from Chinese to English, all of
which have been annotated for word-level tokenization, part-of-speech,
and syntactic structure.

The source files in
data/translation-alternates-included/source_ascii-engtext consist of
only the English text that is being treebanked.  This text was
extracted from files of the same name from the translation releases.
All annotation content is contained within <en=#> metadata tags.

5. Directory Structure

A listing of all of the files in this release can be found in
docs/file.tbl.  A listing of the base data filenames can be found in
docs/file.ids.

The directory structure is as follows:

./docs -- documentation files for this release.
./data/translation-alternates-included/source_ascii-engtext -- the 
     English text extracted from the translation files.  These files 
     include both literal and fluent translation alternates.
./data/translation-alternates-included/ag_xml -- the annotation files 
     in AG format, including all POS and treebank annotation as well 
     as any comments from the annotators.  These files include both 
     literal and fluent translation alternates.
./data/translation-alternates-included/penntree -- the annotation 
     files in Penn Treebank bracketed list style.  These files include 
     both literal and fluent translation alternates, annotated as 
     discussed above.
./data/translation-alternates-removed/penntree -- the annotation
     files in Penn Treebank bracketed list style.  These files do 
     NOT include the literal translation alternates.  These trees 
     include only the fluent translation alternates.  The META nodes 
     containing the literal translation alternates have been removed 
     from these trees, as discussed above.
./dtds -- contains dtds for xml files.

6. File Format Description

6.1 *.txt (in data/translation-alternates-included/source_ascii-engtext)

SU level tokens are separated by line breaks, and introduced by <en=#>
delimiters.

Only round brackets, "(" and ")", are converted to -LRB-/-RRB- form,
and any html character codes have been converted to their
corresponding characters.

Bracket representation is as follows:

() are represented as -LRB- and -RRB- in the .tree files (this
includes emoticons like :--RRB- )

[] {} <> and all other brackets are unchanged.

6.2 *.xml (in data/translation-alternates-included/ag_xml/)

TreeEditor .xml stand-off annotation files.  These files contain only
the POS and Treebank annotations and reference the source files in
data/translation-alternates-included/source_ascii-engtext by character
offset.

6.3 *.tree (in data/translation-alternates-included/penntree/ and in
data/translation-alternates-removed/penntree)

Bracketed tree files following the basic form (NODE (TAG token)).  Each
SU is surrounded by a pair of unlabeled parentheses.  Sample:

( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) )

Usually this means that a pair of unlabeled parentheses surrounds each
sentence, but for SUs containing more than one sentence, the pair of
unlabeled parentheses will contain all necessary top level sentences.

7. Data Validation

automatic tokenization => human correction of tokenization =>
automatic pre-tag => human correction and annotation of Part-of-Speech
=> insertion of META nodes and S nodes for SUs with multiple sentences
=> automatic pre-parse => human correction and annotation of syntactic
structure => QC correction, search-based => additional QC correction,
KBM-based

The first QC process consists of a series of specific searches for
approximately 200 types of potential inconsistency and parser or
annotation error.  Any errors found in these searches were hand
corrected.

We have added an additional QC process that identifies repeated text
and structures, and flags non-matching annotations.  Annotation errors
found in this way have been manually corrected.  See Section 2.2 above
for references to papers that describe this process.

8. DTDs

The DTD files for the AG are kept in
data/translation-alternates-included/ag_xml, as well as in the dtds/
directory.

9. Copyright Information

Portions (c) 2012-2019 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel:
Ann Bies, Senior Research Coordinator, Linguistic Data Consortium,
bies@ldc.upenn.edu

11. Update Log

This index was updated on January 24, 2019 by Seth Kulick