BOLT Egyptian Arabic PropBank -- DF, SMS/Chat, and CTS

             University of Colorado
                    
Authors: Martha Palmer, Jena D. Hwang, Aous Mansouri, Claire Bonial, 
         Tim O'Gorman, James Gung

1. Introduction

 The DARPA BOLT Program created new techniques for automated translation
 and linguistic analysis that can be applied to informal genres of text
 and speech common to online and in-person communications. The BOLT data 
 team led by Linguistic Data Consortium was responsible for collecting 
 informal data sources including discussion forums, text messaging and 
 chat and conversational telephone speech in English, Chinese and Egyptian 
 Arabic, and applying annotations including translation, word alignment, 
 Treebank, PropBank, co-reference and queries/responses.

 This corpus contains the data produced for the PropBank annotation task led 
 by University of Colorado. PropBank annotation provides a layer of semantic 
 annotation on top of the phrase structure of Treebank. Each predicate type in 
 a tree is annotated in terms of its sense and semantic roles. This annotation 
 aims to provide consistent semantic role labels across different syntactic 
 realizations of the same predicate and to assign functional tags to all 
 non-core arguments of the predicate.   

 PropBank data for this release is performed on BOLT Treebank annotation.
 Tokens resulted from Treebank annotation are directly used for PropBank 
 annotation.

 Data covers three genres: DF(Discussion Forums), SMS/Chat, and CTS 
 (Conversational Telephone Speech).  

 Annotation data in this release were originally released in the following
 e-corpora: 
 
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 1 V3.0 (LDC2012E122)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 2 V3.0 (LDC2012E129)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 3 V3.0 (LDC2013E22)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 4 V3.0 (LDC2013E23)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 5 V3.0 (LDC2013E24)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 6 V2.0 (LDC2013E72)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 7 V2.0 (LDC2013E73)
 BOLT Phase 1 Egyptian Arabic Propbank DF Part 8 (LDC2013E108)
 BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 1 (LDC2014E32)
 BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 2 (LDC2014E62)
 BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 3 (LDC2014E98)
 BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 4 (LDC2014E118)
 BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 5 (LDC2014E128)
 BOLT Phase 3 Egyptian Arabic Propbank CTS Part 1 v2 (LDC2015E32)
 BOLT Phase 3 Egyptian Arabic Propbank CTS Part 2 (LDC2015E55)
  
2. Source and Annotation Data

2.1 Source Data

 DF source data was manually harvested online by native speakers and 
 subsequently triaged down to a selected portion and sentence-segmented 
 for translation and annotation. SMS and chat source data are collected via 
 live collection platforms and donations. CTS source data was originally 
 collected for the Arabic and Chinese CallHome and CallFriend program, where
 the collected audio source files were first transcribed and then translated 
 by professional transcription/translation agencies. 

 Source data used for PropBank annotation are tokens from BOLT Egyptian 
 Arabic Treebank data, which were originally released in the following 
 e-corpora: 

 Arabic Treebank ARZ Part 1 (LDC2012E28)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 1 (LDC2012E93)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 2 (LDC2012E98)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 3 (LDC2012E89)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 4 (LDC2012E99)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 5 (LDC2012E107)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 6 (LDC2012E125)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 7 (LDC2013E12)
 BOLT Phase 1 Egyptian Arabic Treebank DF Part 8 (LDC2013E21)
 BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 1 (LDC2013E120)
 BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 2 (LDC2013E133)
 BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 3 (LDC2014E17)
 BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 4 (LDC2014E43)
 BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 5 (LDC2014E63)
 BOLT Phase 3 Egyptian Arabic Treebank CTS Part 1 V2.0 (LDC2014E120)
 BOLT Phase 3 Egyptian Arabic Treebank CTS Part 2 V2.0 (LDC2015E04)
 
2.2 Annotation Data Profile

Language  Genre     .propFile  FrameFiles Rolesets Predicates  SourceTokens 
-------------------------------------------------------------------------
Egyptian  SMS/Chat   1083       n/a       n/a        31397       198007     
Egyptian  DF         730        n/a       n/a        59127       400448        
Egyptian  CTS        112        n/a       n/a        15763       99201 
-------------------------------------------------------------------------
Total                1925       9626      12866      106287      697656

Note: sourceTokens = tree tokens

  
3. Annotation

3.1 Annotation Guidelines

The annotation guidelines are included in this package, and can be
found at docs/APB-Annotation-Guidelines.pdf.  The guidelines were largely
developed under the OntoNotes effort, which was part of the DARPA GALE 
program. They were extended as part of the BOLT program effort to better 
cover the new data genres and Egyptian dialect. 

The annotation data is stored in data/.

PropBank annotation is supported with the Jubilee interface implemented 
by the University of Colorado, where any node in the tree can be selected 
and assigned tags.

3.2 Annotator Training

New Annotators are trained on a set of trial data and are put under careful 
supervision until they reach an adequate level of consistency before they 
start production level annotation. Additionally, all annotators are required 
to attend a bi-weekly meeting to discuss questions and issues encountered 
during annotation. These meetings also serve as help for new annotators and 
as a refresher course for seasoned annotators.

3.3 Annotation Stages

For propBank annotation, predicate argument structure annotation is carried 
out in two phases. In the first phase, a frame file for a predicate is created 
by examining all instances of the predicate in the Treebank data and 
distinguishing two or more senses, which are called Framesets or Rolesets. 

In the second phase, the predicate argument structure of all instances of 
the predicate are annotated, using the Frame File as a reference. The 
arguments of each predicate receive an argument label in the form of ArgN, 
where N is an integer between 0 and 6. These numbered arguments represent 
core arguments that are defined in relation to the predicate. Each core 
argument plays a unique role with regard to the predicate. Core arguments 
are as consistent as possible with respect to thematic roles. Arg0 is used 
for the most agentive role a given predicate can take. Arg1 is used for the 
proto-patient, or most patient-like argument. Arg2 is most often used to 
mark a beneficiary, Arg3 is most often used to show a start point, and Arg4 
is most often used for the end point. Args2-4 are less consistent, as not 
all verbs with more than 2 core roles require a start/end point role or a 
beneficiary, so these are used in other ways as dictated by a given predicate.

3.4 Annotation Quality Control

All of the annotations are the result of double blind annotation followed by 
adjudication of disagreements.

4. Data Structure and File Format

4.1 .prop Files

The proposition format is described in docs/APB-data-format.txt.

4.2 Frame Files

The frame files are in XML format. The definition is included in 
docs/verb.dtd. 

ARZ frame files can be distinguished from the MSA frame files by the topmost 
comment "EGYPTIAN ARABIC".

4.3 Using Multi-layer Annotation Data 

PropBank annotations make use of corresponding .tree files for each document, 
and the annotations use the sentence divisions and tokenizations from those 
tree files. The way that the .prop files relate to the .tree files is detailed 
in the .prop file description at docs/APB-data-format.txt.

PropBank annotations in this package can be used together with other type 
of BOLT annotatin data as the same source tokens are annotated in multi-levels, 
including treebank, word alignment, co-reference annotations. Tokens are 
numbered in the same way, and identical filebase/filestem names are used 
across annotations. Each type of annotation adds its own file extensions. 
So users can find other types of annotations according to the same filestem 
names. 

5. Package Directory Structure

--docs
  --README.txt
  --APB-Annotation-Guidelines.pdf  
  --APB-data-format.txt  
  --filelist.txt
--data
   --annotation
      --cts/{01,02}/*.prop 
      --df/{01,02,03,04,05,06,07,08}/*.prop
      --sms_chat/{01,02,03,04,05}/*.prop  
    --metadata
      --frames/*.xml 
--dtds
  --verb.dtd

6. Documentation

 -docs/README.txt: this file.
 -docs/APB-data-format.txt: this document explains data format of the 
  Egyptian Arabic PropBank annotation
 -docs/APB-Annotation-Guidelines.pdf: Egyptian Arabic PropBank annotation 
  guidelines
 -docs/filelist.txt: the list of files showing the directory structure of 
  this package
 -dtds/verb.dtd: this document specifies frame file format. 

7. Data Validation and Sanity Check

 - Validate XML files against DTD (in the docs/)
 - Verify filename stems consistent with tree filename stems
 - Verify encoding as UTF-8
 - XML frame files were validated against docs/frameset.dtd and were
   checked for frame internal consistency (e.g. misspelling, extraneous 
   characters, general correctness).

8. Acknowledgements

 This material is based upon work supported by the Defense Advanced Research
 Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does
 not necessarily reflect the position or the policy of the Government, and no
 official endorsement should be inferred.

 Annotation contributors: James Babani, Yahya Aseri, Maha Foster, trainers and 
 adjudicators
 
 Stephanie Strassel,  Xuansong Li, and Stephen Grimes from LDC made contributions
 to propBank data via drafting documentation, sanity-checking data, specifying 
 data format, and streamlining data release process.   

9. Copyright Info

(c) 2012, 2013, 2014, 2015, 2016, 2017 Trustees of the University of Pennsylvania.

10. Contact Information

If you have questions about this data release, please contact the following 
personnel:

Martha Palmer <martha.palmer@colorado.edu>
Stephanie Strassel <strassel@ldc.upenn.edu>
Xuansong Li <xuansong@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created December 21, 2016 by Xuansong Li and Tim O'Gorman