REFLEX Entity Translation Training/DevTest

			      LDC2009T07

		      Linguistic Data Consortium

1. Introduction

  This corpus constitutes the complete set of training data and
  development test data for the 2007 REFLEX Entity Translation
  evaluation.  The total set of Training/DevTest data constitutes
  approximately 67.5k words for each of three languages: English,
  Chinese and Arabic. The data set is made up of 22.5k words of
  English data, 22.5k words of Chinese data, and 22.5k words of Arabic
  data translated into each of the other two languages.

  The "timex2norm" directories contain the "official" (i.e. final)
  versions of the annotation files (.apf.xml) and the source text
  files (.sgm).  All files in the "timex2norm" directories have
  undergone all stages of the annotation pipeline, including all batch
  QC processes.  Users of this data release should train or test with
  these files, not the files under the "1p" and "2p" directories --
  which reflect "incomplete" data at various stages in the annotation
  pipeline.

  The apf.xml files included in this release use the APF DTD version
  5.1.2, which was posted on the NIST ACE website on 12/18/2006.

  http://www.nist.gov/speech/tests/ace/ace07/doc/index.htm

  The only difference between version 5.1.1 and version 5.1.2 is that
  NOMPOST and NAMPOST have been added to the LDCTYPE attribute value
  list of entity_mention in version 5.1.2.  These values are used only
  for Spanish, which is not included in this release.

2. Segment Alignment

  Segments are sentence-like units that are semi-automatically
  identified prior to translation.  Translators are instructed to
  translate one segment at a time, but always within the context of
  the entire source document.  This means that in most cases, the
  entities mentioned within a given segment for the source document
  will also be mentioned in that same segment for the translations of
  that document.  This is not always the case, due to the occasional
  translation error, or to valid alternative syntactic or lexical
  choices by the translator that result in imperfect segment-entity
  alignment.

  This package contains two types of information for segment
  alignment.  Segment Alignment Text (.seg) files contain lists of
  segments with their apf start- and end- character offsets for each
  language.  Additionally, within the /doc directory is the file
  allEntities.tab, intended for viewing by humans.  This is a
  tab-delimited table that lists every entity mentioned in a given
  segment for a document, across all three languages.  The table is
  sorted by Document ID then by segment and includes EntityID, Type,
  Subtype, Head, Level and Class for every entity/language.  Note that
  the entities themselves are not aligned or mapped in this table.

  Segment alignment information is not available for all documents.
  For some documents, different segmentations were used for each of
  the two translations that were completed. In other documents,
  substantial formatting or encoding changes were introduced during
  the translation process. In the following documents, the two
  versions of the translation file were disparate enough that the
  segments have not been aligned:

  - ALFILFILM_20050203.1756
  - ALFILFILM_20050205.0832
  - APW_ENG_20030424.0698
  - APW_ENG_20030527.0232
  - DIGRESSING_20041101.1921
  - DIGRESSING_20041107.0106
  - DIGRESSING_20041206.0246
  - DIGRESSING_20041213.0439
  - DIGRESSING_20050107.0236
  - DIGRESSING_20050115.0958
  - DIGRESSING_20050121.2236
  - DIGRESSING_20050122.0117
  - DIGRESSING_20050123.1900
  - DIGRESSING_20050201.1820
  - EGYDAYS_20050221.1227
  - FLOPPINGACES_20041113.1528.042
  - MARKETVIEW_20050127.0716
  - chtb_254

3. Annotation

3.1 Tasks and Guidelines

  Data contained in this release has been annotated for the following tasks:

    - Entities
    - TIMEX2 extents & normalization

  The annotation guidelines used for this corpus were basically the
  same as the ACE 2005 guidelines.  

  The ACE 2005 annotation guidelines for each language can be
  downloaded from LDC's ACE website:

  http://projects.ldc.upenn.edu/ace/

  The annotation guidelines used for TIMEX2 annotation can be found
  here:

  http://projects.ldc.upenn.edu/ace/docs/English-TIMEX2-Guidelines_v0.1.pdf

3.2 Annotation Process

  Training/DevTest data are annotated for all tasks by one annotator
  and then second-pass annotated by a senior annotator or team
  leader. The first pass (complete) annotation is called 1P. The
  second pass (complete) annotation is called 2P. For 1P, a single
  junior annotator completes all tasks (entities and TIMEX2 extents)
  for a file. For 2P, a more experienced senior annotator reviews the
  first-pass annotations and corrects any errors they identify. Then,
  TIMEX2 values are normalized by an annotator who was specifically
  trained for the task.  This task is known as NORM.  After NORM,
  additional corpus-wide quality control (QC) spot-checks are
  conducted on normalized data by the team leader and selected senior
  annotators.

  The full annotation process for REFLEX Entity Translation
  Training/DevTest data is represented below:

              1P: entities
                  TIMEX2 extents
                  |
                  V
              2P: entities
                  TIMEX2 extents
                  |
                  V
            NORM: TIMEX2 normalization 
                  |
                  V
              QC: entities
                  TIMEX2 extents
                  TIMEX2 normalization 

4. Source Data Profile

  Below is a description of the data sources and epochs for the
  Training/DevTest data set.

  - REFLEX Entity Translation Training/DevTest

    * Sources: ACE 04, ACE 05 training pools

      - Newswire (NW): 

          AFP_ENG (Agence France-Presse - English) - 2003.03-2003.06 
          AFA     (Agence France-Presse - Arabic)  - 2000.10-2000.12
          ALH     (Al Hayat - Arabic)              - 2000.10-2000.12
          ANN     (An Nahar - Arabic)              - 2000.10-2000.12
          APW_ENG (Associated Press - English)     - 2003.03-2003.06 
          XIN     (Xinhua News Agency - Chinese)   - 2000.10-2000.12
          XIN_ENG (Xinhua News Agency - English)   - 2003.03-2003.06
          ZBN     (Zaobao News Agency - Chinese)   - 2000.10-2000.12

          Note: Files taken from ACE04 have three-letter source IDs
                (e.g., AFA).  Files taken from ACE05 use a newer
                convention: a three-letter source ID, an underscore
                and a three-letter language ID (e.g., APW_ENG).

      - Weblog (WL):

          Various sources                          - 2004.11-2005.02

      - Arabic Treebank (ATB):

          ANN (An Nahar)                           - 2002.01-2002.05

      - Chinese Treebank (CTB):

          Xinhua News Agency                       - 1994.09-1998.01

    * 3-way translation

      - 22.5 Kw English -> Arabic, Chinese
      - 22.5 Kw Chinese -> Arabic, English
      - 22.5 Kw Arabic  -> Chinese, English

      - Untranslatable text: In some cases, a word or phrase was
        deemed "untranslatable" by the professional translation
        agencies. In cases where a word or phrase cannot be
        translated into a target language, an empty
        "<UNTRANSLATEDTEXT/>" tag has been included in the 
        translation text file.

    * Total of approximately 67,500 words/language

      The word counts vary by language since translated files often
      have larger or smaller word counts than the files in the source
      language.  In particular, English files appear to have larger
      word counts than the corresponding Arabic files.

5. Annotation Data Profile

  Below is information about the amount of data included in the
  current release and its annotation status.

    - 1P: data subject to first pass (complete) annotation
    - 2P: data subject to second pass (complete) annotation
    - NORM: data also subject to TIMEX2 normalization

  Note: Chinese data expressed in terms of characters.  We assume
  a correspondence of roughly 1.5 characters/word.

  - Arabic-source

                 =========words==========      ==========files======
                   1P       2P     NORM         1P      2P    NORM
    Arabic ATB    4455     4455     4455         9       9      9
    Chinese ATB   9680     9680     9680         9       9      9
    English ATB   6233     6233     6233         9       9      9
    Arabic NW     7600     7600     7600        48	48     48  
    Chinese NW    9954     9954     9954        48      48     48
    English NW   16657    16657    16657        48      48     48
    Arabic WL    10638    10638    10638        38      38     38
    Chinese WL   23312    23312    23312        38      38     38
    English WL   14563    14563    14563        38      38     38
                -------------------------      ---------------------
    Arb Total    22693    22693    22693        95      95     95
    Chn Total    49649    49649    49649        95      95     95
    Eng Total    30750    30750    30750        95      95     95


  - Chinese-source

                 =========words==========      ==========files======
                   1P       2P      NORM        1P      2P    NORM
    Arabic CTB    2505     2505     2505        15      15     15
    Chinese CTB   4782     2782     4782        15      15     15
    English CTB   3244     3244     3244        15      15     15
    Arabic NW     7465     7465     7465        22	22     22  
    Chinese NW   13589    13589    13589        22      22     22
    English NW    8980     8980     8980        22      22     22
    Arabic WL     8847     8847     8847        19      19     19
    Chinese WL   15364    15364    15364        19      19     19
    English WL   10761    10761    10761        19      19     19
                -------------------------      ---------------------
    Arb Total    18817    18817    18817        56      56     56
    Chn Total    33735    33735    33735        56      56     56
    Eng Total    22985    22985    22985        56      56     56


  - English-source
    
                 =========words==========      ==========files======
                   1P       2P     NORM         1P      2P    NORM
    Arabic NW    15106    15106    15106        34	34     34  
    Chinese NW   31721    31721    31721        34      34     34
    English NW   17104    17104    17104        34      34     34
    Arabic WL     8918    8918      8918        29      29     29
    Chinese WL   18953    18953    18953        29      29     29
    English WL   10302    10302    10302        29      29     29
                -------------------------      ---------------------
    Arb Total    24024    24024    24024        63      63     63
    Chn Total    50674    50674    50674        63      63     63
    Eng Total    27406    27406    27406        63      63     63

6. Data Directory Structure

  The data are organized by source language: 
     data/arbic-source/
     data/chinese-source/
     data/english-source/

  Then by data type:
     atb/
     ctb/
     nw/
     wl/

  Then by annotation status:
     
   - 1p: data subject to first pass (complete) annotation

   - 2p: data subject to second pass (complete) annotation

   - timex2norm: data also subject to TIMEX2 normalization, plus
     additional QC.

  So for instance, if a Chinese source file has been fully annotated,
  you will find an .apf.xml annotation file of the source Chinese and
  the English and Arabic translations in each of "1p", "2p" and
  "timex2norm". Here is an example:
  	
	Chinese source:
	data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.apf.xml

	English translation:
	data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.eng.apf.xml

	Arabic translation:
	data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.arb.apf.xml

  The "FileList" files contain information about the word (for English
  and Arabic) or character (for Chinese) counts and annotation status
  for each file in the release. 

  The "doc" directory contains segment-aligned entity mentions from
  each language.

  Entity Mention Table (.tdf) File: doc/allEntities.tdf

        This report contains all mentions from all languages for all
        documents in a single tab-delimited text file with the
        following fields:

        - docid
        - segment number
        - English entity id
        - English entity type
        - English entity subtype
        - English mention head
        - English mention type (level)
        - English entity class
        - Chinese entity id
        - Chinese entity type
        - Chinese entity subtype
        - Chinese mention head
        - Chinese mention type (level)
        - Chinese entity class
        - Arabic entity id
        - Arabic entity type
        - Arabic entity subtype
        - Arabic mention head
        - Arabic mention type (level)
        - Arabic entity class

       No attempt has been made to align entity mentions between
       languages. In other words, the English entity mentions for a
       segment are listed in the order that they appear in the English
       text, the Chinese mentions in the order they appear in the
       Chinese text, and the Arabic mentions in the order they appear
       in the Arabic text. Entity mentions on the same line of the
       .tdf file may or may not correspond to mentions of the same
       entity across languages.
  
7. File Format Description

   Each directory contains files of the following formats.  For most
   users, the most important files are the .sgm files and .apf.xml
   files.

   Source Text (.sgm) Files

      - These files contain the source text files in an SGM format.
	These files use the UNIX-style end of lines.  All .sgm files are
	in UTF-8.

   ACE Program Format (APF) (.apf.xml) Files

      - These files are in the official ACE annotation file format.  See
        section 8 for more details.

   AG (.ag.xml) Files

      - These are annotation files created with the LDC's annotation
        toolkit.  These files have been converted to the corresponding
        .apf.xml files.

   ID table (.tab) Files

      - These files store mapping tables between the IDs used in the
        ag.xml files and their corresponding apf.xml files.

8. Data Validation

  Below is a description of the sanity checks and other format
  validation steps applied to annotation files created by LDC. 

  Checks included in the annotation tool or applied automatically:

    -- Extents stripped of all spaces and punctuation at front and back
    -- GPE mentions without roles were fixed
    -- For non-GPE mentions with roles, roles were removed
    -- All non-complex entity mentions have heads.  For APF, this means
       that all entity mentions have heads
    -- All NAMPRE and NOMPRE GPE mentions have GPE as their role
    -- All files have exactly one timex2 annotation in the DATETIME field
    -- No annotation extents overlap without nesting (entity mention, 
       entity mention head, timex2 mention)
    -- There are no annotations inside of sgm tags
    -- All entities have permissible type-subtype pairs
    -- All files successfully convert to APF
    -- All APF files validate against DTD
    -- All APF files can be scored against themselves
    -- Search for untagged pronouns (English, Arabic)
    -- Search for English Building-Grounds mentions containing "Airport" or
       "Airfield"
    -- Search for untagged relative clauses (English)
    -- Search for demonstratives tagged as WHQ (Arabic)
    -- Scan all unannotated common TIMEX2 and value triggers (English)
    -- Check that all POSTDATEs are annotated
    -- Check for missing SPEAKER annotations
    -- Check for missing POSTER annotations

  Checks applied after annotation as additional QC:

    -- No English passages are annotated in non-English files
    -- All instances of cross-type metonymy manually reviewed
    -- All instances of co-extensive entity mentions with the same heads
       manually reviewed
    -- Manual scan of all NOM heads with different entity type/subtype
       values in different parts of the corpus (normalized files only)
    -- Manual scan of all NAM heads with different entity type/subtype
       values in different parts of the corpus (normalized files only)
    -- Manual scan of all entity mention heads by entity type/subtype for
       outliers in normalized files
    -- Manually examine and correct or describe all fatal errors and warnings
       generated by the most recent version of the scorer
    -- Manually review cases where a SPEAKER or POSTER annotation is not
       coreferenced with an entity mention outside of SPEAKER or
       POSTER tags
    -- Manually review timex2 normalization values

9. Notes About APF

   - Offsets

     APF uses the offset counting method traditionally used in previous
     ACE evaluation programs: 

       1) Each (UTF-8) character, not byte, is counted as one.  

       2) Each newline character is counted as one.  (The .sgm files
          use the UNIX-style end of line characters.)

       3) SGML tags are *not* counted towards offsets.  (Please note
          that the AG files included in this release do count SGML tags in
          offsets.)

       4) SGML entities are counted in terms of each character in the
          entities.  For example, "&amp;" is counted as five
          characters, not as one character.

   - TIMEX2

     The timex2 element represents TIME2 timex expression annotations.
     Its optional attributes, such as "VAL" and "MOD", represent the
     TIMEX2 normalization values. 

   - TYPE, LDCTYPE and LDCATR in entity_mention

     The TYPE attribute of entity_mention store the official ACE entity
     mention types, and the LDCTYPE and LDCATR attributes store the
     attributes used in the LDC's annotation process.

   - Name in entity_attributes

     The "name" element in entity_attributes stores the heads of
     "NAM"-type mentions as in the previous years.  In response to
     George Doddington's request, we have added the NAME attribute to
     the "name" element.  The NAME attribute stores slightly normalized
     versions of the names where:

     - \n is replaced with a space
     - multiple spaces are reduced to one space
     - " (double quote) is removed

     - Example:

     <entity_attributes>
        <name NAME="United States">
           <charseq START="4242" END="4254">United
     States</charseq>
        </name>
     </entity_attributes>

   - Nickname metonymy

     Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in
     entity_mentions.  "NAN"-type entity mentions marked as nickname
     metonymy do not give rise to name elements.

   - Cross-type metonymy

     "Cross-type" metonyms are represented with relations of the type
     METONYMY.  The METONYMY type relations do not have
     relation_mentions.  The METONYMY type relations are automatically
     generated after the annotation process, and are the only kind of
     relation annotations that appear in this corpus.

   - For more details, please refer to the APF V5.1.2 DTD.

10. DTDs

   The following DTDs are in the dtd subdirectory.

     apf.v5.1.2.dtd             - XML DTD for APF files

     ace_source_sgml.v1.0.4.dtd - SGML DTD for .sgm files

     ag-1.1.dtd                 - XML DTD for AG files

11. Copyright Information

   Portions (c) 1994-1998, 2000, 2003 Xinhua News Agency, (c) 2000,
   2003 Agence France-Presse, (c) 2003 Associated Press Newswire, (c)
   2000, Al-Hayat, (c) 2000, 2002 An-Nahar, (c) 1994-2009 Trustees of
   University of Pennsylvania.

12. Contact Information
  
   If you have questions about this data release, please contact the
   following personnel at the LDC. 

   Zhiyi Song         <zhiyi@ldc.upenn.edu>    - REFLEX-MTE Project Manager
   Stephanie Strassel <strassel@ldc.upenn.edu> - LDC Annotation Group
                                                 Director/REFLEX-MTE
						 Consultant
   Kazuaki Maeda <maeda@ldc.upenn.edu>         - Technical Consultant/Manager

   The following former members of LDC also contributed to the
   creation of this corpus.

   Christopher Walker - REFLEX-MTE Project Manager
   Julie Medero       - REFLEX-MTE Lead Developer

--------------------------------------------------------------------------
README Updated for LDC General Publication January 16, 2009 Kazuaki Maeda