BOLT English Treebank - SMS/Chat
CatalogID: LDC2021T03
Release date: January 15, 2021
Linguistic Data Consortium
Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick

1. Introduction

This release of English Treebank consists of 115,667 tokens/words in
484 files of English source SMS/Chat text from various sources
annotated for part-of-speech and syntactic structure.

This data was previously released as subcorpora in earlier versions to the
BOLT community; this publication consolidates the Phase 2 English Treebank
SMS/Chat. The corpora that were released to the BOLT community previously
had the catalog numbers LDC2013E127 (SMS/Chat Part 1) and
LDC2014E03 (SMS/Chat Part 2).

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145.
The content does not necessarily reflect the position or the policy of
the Government, and no official endorsement should be inferred.

2. Annotation

2.1 Tasks and Guidelines

The guidelines followed for both part-of-speech and treebank
annotation are essentially Penn Treebank II style, with the changes to
those guidelines that were developed under the GALE (Global Autonomous
Language Exploitation) program, primarily in the tokenization of
hyphenated words, part-of-speech and tree changes necessitated
by those tokenization changes, and updates to the syntactic annotation
to comply with the most updated annotation guidelines
(including the "Treebank-PropBank merge" or "Treebank IIa"
and "treebank c" changes).  The original Penn Treebank II guidelines can
be found at https://repository.upenn.edu/cis_reports/1021/
Addenda detailing the more recent changes can be found at
docs/EnglishTreebankSupplementalGuidelines.pdf.
Additional guidelines developed specifically to account for novel
constructions, etc. in the webtext genre can be found in
docs/WebtextTBAnnotationGuidelines.pdf.  Additional guidelines
developed to account for novel constructions, etc. in the SMS/Chat
genre can be found in docs/TreebankGuidelines-ChatSMS-v1.3.pdf.

All co-reference indices are shown on the syntactic node label,
including reference indices on the node labels for the empty
categories (as in all LDC English and Arabic Treebank releases).

2.2 Annotation Process

The English text to be treebanked was extracted by script from files
of the same name that had undergone su-level tokenization.

Higher level ASCII characters or non-ASCII characters in the text were
reduced to comparable common ASCII characters, as detailed in
docs/edit.list.  A listing of all of the changed characters in this
release is in docs/changed_chars.list.

Word-level tokenization by script was manually corrected to be
consistent with tokenization guidelines developed for English
Treebanks in the GALE project. SU-level tokenization, however,
was inherited from the SU annotation and translation, and was not
adjusted during the treebank process (with the exception of multiple
sentences contained in a single SU, which were necessarily treebanked
as separate sentences).

The tokenized data was run through an automatic POS tagger, and the
tagger output was manually corrected to be consistent with the English
Treebank part-of-speech annotation guidelines in
https://repository.upenn.edu/cis_reports/570/
and addenda listed in Section 2.1.

The corrected POS annotated data was run through an automatic parser.

The parser output was manually corrected to be consistent with the
English Treebank syntactic annotation guidelines in
https://repository.upenn.edu/cis_reports/1021/
and addenda listed in Section 2.1.

The first QC process consists of a series of specific searches for
approximately 200 types of potential inconsistency and parser or
annotation error.  Any errors found in these searches were hand
corrected.

We also included an additional QC process that identifies repeated text
and structures, and flags non-matching annotations.  Annotation errors
found in this way have been manually corrected.  This error detection
system was work in progress and in the process of refinement at the
time this annotation.

The following papers describe this process:

Seth Kulick, Ann Bies, and Justin Mott
Using Derivation Trees for Treebank Error Detection
ACL 2011, Portland, Oregon, USA, June 19-24, 2011
http://papers.ldc.upenn.edu/ACL2011/DerivationTrees_TBErrorDetection.pdf

Seth Kulick and Ann Bies
A TAG-derived Database for Treebank Search and Parser Analysis
TAG+10: 10th International Workshop on Tree Adjoining Grammars and Related Formalisms, New Haven, CT, June 10-12, 2010
http://papers.ldc.upenn.edu/TAG2010/tag-paper-correct.pdf

Lead annotators for this project were Justin Mott and Colin Warner.
Additional annotators were John Laury, Jonathan Gress-Wright and
Arrick Lanfranchi.

3. Source Data Profile

3.1 Data Selection Process

Data was selected for this English treebank annotation in order to
maximize multiple annotations.

3.2 Data Sources and Epochs

The data consists of English source SMS/Chat text from various sources
collected by LDC.  The original collection of the source data
has been released as BOLT English SMS/Chat (LDC2018T19).

4. Annotated Data Profile

The data consists of 115,667 tokens/words in 484 files of English
source SMS/Chat genre text from various sources, all of which have
been annotated for word-level tokenization, part-of-speech, and
syntactic structure.

The source files in data/source_ascii-engtext consist of only the
English text that is being treebanked.  This text was extracted from
files of the same name from the SU annotation.  All annotation content is
contained within <en=#> metadata tags.  Any non-ASCII characters in
the original SU version of these files have been replaced with low
level ASCII characters according to docs/edit.list.  A listing of all
of the changed characters in this release is in
docs/changed_chars.list.

5. Directory Structure

A listing of all of the files in this release can be found in
docs/file.tbl.  A listing of the base data filenames can be found in
docs/file.ids.

The directory structure is as follows:

./docs -- documentation files for this release.
./data/source_ascii-engtext -- the English text extracted from the 
     SU files, with any higher-level ASCII characters or non-ASCII 
     characters changed into common ASCII characters (details 
     in docs/edit.list)
./data/ag_xml -- the annotation files in AG format, including all 
     POS and treebank annotation as well as any comments from the 
     annotators
./data/penntree -- the annotation files in Penn Treebank bracketed 
     list style
./dtds -- dtds for the included xml

6. File Format Description

6.1 *.flat (in data/source_ascii-engtext)

SU level tokens are separated by line breaks, and introduced by <en=#>
delimiters.

Only round brackets, "(" and ")", are converted to -LRB-/-RRB- form,
and any html character codes have been converted to their
corresponding characters.

Bracket representation is as follows:

() are represented as -LRB- and -RRB- in the .tree files (this
includes emoticons like :--RRB- )

[] {} <> and all other brackets are unchanged.

Any higher-level ASCII characters or non-ASCII characters from the SU
files are changed to common ASCII characters (details in
docs/edit.list).  A listing of all of the changed characters in this
release is in docs/changed_chars.list.

6.2 *.xml (in data/ag_xml/)

TreeEditor .xml stand-off annotation files.  These files contain only
the POS and Treebank annotations and reference the source files in
data/source_ascii-engtext by character offset.

6.3 *.tree (in data/penntree/)

Bracketed tree files following the basic form (NODE (TAG token)).  Each
SU is surrounded by a pair of unlabeled parentheses.  Sample:

( (S (NP-SBJ (PRP I)) (VP (MD 'll) (VP (VB post) (NP (NP (NNS highlights)) (PP (IN from) (NP (DT the) (NN opinion) (CC and) (NNS dissents)))) (SBAR-TMP (WHADVP-9 (WRB when)) (S (NP-SBJ (PRP I)) (VP (VBP 'm) (ADJP-PRD (JJ finished)) (ADVP-TMP-9 (-NONE- *T*))))))) (. .)) )

Usually this means that a pair of unlabeled parentheses surrounds each
sentence, but for SUs containing more than one sentence, the pair of
unlabeled parentheses will contain all necessary top level sentences.

7. Data Validation

automatic tokenization => human correction of tokenization =>
automatic pre-tag => human correction and annotation of Part-of-Speech
=> automatic pre-parse => human correction and annotation of syntactic
structure => QC correction, search-based => additional QC correction,
KBM-based

The first QC process consists of a series of specific searches for
approximately 200 types of potential inconsistency and parser or
annotation error.  Any errors found in these searches were hand
corrected.

We have added an additional QC process that identifies repeated text
and structures, and flags non-matching annotations.  Annotation errors
found in this way have been manually corrected.  See Section 2.2 above
for references to papers that describe this process.

8. DTDs

The DTD files for the AG are kept in
data/translation-alternates-included/ag_xml, as well as in the dtds/
directory.

9. Copyright Information

Portions (c) 2012-2021 Trustees of the University of Pennsylvania

10. Contact Information

Contact info for key project personnel:
Ann Bies, Senior Research Coordinator, Linguistic Data Consortium,
bies@ldc.upenn.edu

11. Update Log

This index was updated on April 10, 2019 by Seth Kulick