ACE Spanish DevTest - 2007 Pilot Evaluation
            
Authors: Christopher Walker, Zhiyi Song, Kazuaki Maeda, Stephanie Strassel

1. Introduction

This file contains documentation on the ACE Spanish DevTest - 2007 Pilot 
Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2015T20 and 
ISBN 1-58563-730-0.

This publication contains the complete set of of DevTest data for the ACE 
2007 Spanish Pilot Evaluation to support the 2007 Automatic Content 
Extraction (ACE) technology evaluation. The corpus consists of data of 
the newswire types annotated for entities and Timex2 and was created by 
Linguistic Data Consortium with support from the ACE Program. This data 
was previously distributed as an e-corpus (LDC2007E10) to participants 
in the 2007 ACE evaluation. 

The objective of the ACE program is to develop automatic content extraction
technology to support automatic processing of human language in text
form. 

In January and February 2007, sites were evaluated on system performance in five
primary areas: the recognition of entities, values, temporal expressions,
relations, and events.  Entity, relation and event mention detection were
also offered as diagnostic tasks.  All tasks were performed for English and 
Chinese. Entity and Temporal expression recognition were performed for Arabic 
and Spanish. The current publication comprises the devTest data for the Spanish
pilot evaluation.

A complete description of the ACE 2007 Evaluation can be found on the ACE
Program web site maintained by the National Institute of Standards and
Technology (NIST): http://www.nist.gov/speech/tests/ace/

2. Annotation

2.1 Tasks and Guidelines

Data contained in this release has been annotated for the following tasks:

    - Entities
    - TIMEX2 extents & normalization

The annotation guidelines used for this corpus were basically the
same as the ACE 2005 guidelines. They can be found under /docs
  
  Spanish Entity:  Spanish-Entities-Guidelines_v1.6.pdf
  Timex2:          TIMEX2-Guidelines_v0.1.pdf 

2.2 Annotation Process

DevTest data files in this package are dually annotated for all tasks by
two annotators working independently, like all previously released 
devTest data in other languages.  The first pass annotation is called 1P; 
the independent dual first pass annotation is called DUAL.  For both 1P 
and DUAL, a single annotator completes all tasks for a file.  Files are 
assigned via an automated Annotation Work-flow System (AWS), and file 
assignment is double-blind.  

** NOTE: 1P and DUAL are stored as 'fp1' and 'fp2' in this package.

Discrepancies between the 1P and DUAL version of each file are then
adjudicated by a senior annotator or team leader, resulting in a
high-quality gold standard file.  The gold standard adjudicated file is
known as ADJ.  After adjudication, TIMEX2 values are normalized, which 
is known as NORM.  The complete set of the data in this release has 
been NORM annotated.

After NORM, additional corpus-wide quality control (QC) spot-checks are
conducted on normalized data by the team leader and selected senior
annotators.

The full annotation process for Spanish 2007 DevTest is represented below:

1P: entities         DUAL: entities
    TIMEX2 extents         TIMEX2 extents

        |                    |
        |                    |
        |____________________|
                  |
                  |
                  |
                  V
             ADJ: entities
                  TIMEX2 extents
                  |
                  |
                  |
                  V
             NORM: TIMEX2 normalization

3. Source Data Profile
3.1 Data Selection Process

Data selection is semi-automatic.  A document pool is established
for each language based on the genre and epoch requirements.  Humans
then review the pool to select individual documents that are
suitable for ACE annotation, for instance documents that are
representative of their genre and contain targeted ACE entity types.

3.2 Training Data Sources and Epochs

Below is a description of the data sources and epochs for the
DevTest data set.
 
Spanish
 
    * Newswire (NW): 100%
      sources: Spanish Gigaword Corpus (AFP, APW, Xinhua)
               AFP_SPA - Agence France-Presse, Spanish Service
               APW_SPA - Associated Press Worldstream, Spanish Service
               XIN_SPA - Xinhua, Spanish Service
      training epoch: Jan-Apr 2005
      dev test epoch: May 2005
      test epoch: Jun 2005

4. Annotation Data Profile

Below is information about the amount of data included in the
current release and its annotation status.

    - 1P: data subject to first pass annotation
    - ADJ: data also subject to discrepancy resolution/adjudication
    - NORM: data also subject to TIMEX2 normalization

  - Spanish Word Counts:

        =========words========  ========files========
            1P    ADJ   NORM      1P    ADJ   NORM
    NW   53517  53517  53517     167    167    167
        ----------------------  ---------------------
  Total  53517  53517  53517     167    167    167

5. Data Directory Structure

The data are organized by language, data type and annotation status as
follows:

   - fp1: data subject to first pass (complete) annotation
   - fp2: data subject to first pass (complete) annotation
   - adj: data subject to adjudication (complete) annotation
   - timex2norm: data also subject to TIMEX2 normalization, plus
                 additional QC.

For a given document, you will find its source .sgm file together
with the .ag.xml and .apf.xml annotation files in each of the three
directories "fp1", "fp2", "adj" and "timex2norm".

In other words, for each newswire story, the three annotation 
directories each contain an identical copy of the source
text (.sgm file) along with distinct versions of the associated
annotations (.ag.xml, apf.xml and .tab files).  Note that in many
cases, two annotation stages have produced identical output for a
given source text, if no changes were made in the latter stage. 

The file "FileList.tab", in the docs/ directory, contains information 
about the word counts and annotation status for each file in the 
release.

The following indicates where to find the completed annotation files
and their corresponding source files.

    */timex2norm/*sgm
    */timex2norm/*apf.xml 

6. File Format Description

Each directory contains files of the following formats.  For most
users, the most important files are the .sgm files and .apf.xml
files.

   Source Text (.sgm) Files

      - These files contain the source text data in an SGML format; they
        use UTF-8 encoding and UNIX-style line termination.

   AG (.ag.xml) Files

      - These are annotation files created with the LDC's annotation
        toolkit.  These files have been converted to the corresponding
        .apf.xml files.
        
   ACE Program Format (APF) (.apf.xml) Files

      - These files are in the official ACE annotation file format. ACE 
        format is derived by means of a routine format conversion process,
        so that the underlying annotation content of the two files is 
        equivalent  See section 8 for more details.

   ID table (.tab) Files

      - These files store mapping tables between the IDs used in the
        ag.xml files and their corresponding apf.xml files.

7. Data Validation

Below is a description of the sanity checks and other format
validation steps applied to annotation files created by LDC. 

Checks included in the annotation tool or applied automatically:

    -- Extents stripped of all spaces and punctuation at front and back
    -- GPE mentions without roles were fixed
    -- For non-GPE mentions with roles, roles were removed
    -- All non-complex entity mentions have heads.  For APF, this means
       that all entity mentions have heads
    -- All NAMPRE and NOMPRE GPE mentions have GPE as their role
    -- All files have exactly one timex2 annotation in the DATETIME field
    -- No annotation extents overlap without nesting (entity mention, 
       entity mention head, timex2 mention)
    -- There are no annotations inside of sgm tags
    -- All entities have permissible type-subtype pairs
    -- All files successfully convert to APF
    -- All APF files validate against DTD
    -- All APF files can be scored against themselves
    -- Search for demonstratives tagged as WHQ

Checks applied after annotation as additional QC

    -- All instances of cross-type metonymy manually reviewed
    -- All instances of co-extensive entity mentions with the same heads
       manually reviewed
    -- Manually examine and correct or describe all fatal errors and warnings
       generated by the most recent version of the scorer
    -- Manual scan of all NOM heads with different entity type/subtype
       values in different parts of the corpus (normalized files only)
    -- Manual scan of all NAM heads with different entity type/subtype
       values in different parts of the corpus (normalized files only)
    -- Manual scan of all entity mention heads by entity type/subtype for
       outliers in normalized files
    -- Checked for Multiple-word non-NAM heads and corrected them
    -- Checked for other 'multi-word' WHQ pronouns and corrected them 
        (i.e. [l...] as WHQ)
    -- Checked for relative clauses not annotated in their respective
        mention NPs ( searched for '... [NOM] [WHQ] ...' and corrected
        by hand)
    -- Checked for and corrected all cases where a determiner is in the 
        scope of a mention's head
    -- Checked all cases where a PER mention was used as a POST modifier
    -- Checked an extensional list of Nation names tagged as ORG.GOV
    -- Checked an extensional list of WHQ terms not tagged as WHQ
    -- Looked for Parentheses in the head of any mentions
    -- Checked an extensional list of WEA.Exploding terms not tagged as 
        WEA.Exploding
    -- Checked that EU was always tagged as 'GPE.GPE-Cluster'
    -- Manually review timex2 normalization values

8. Notes About APF

   - Offsets

     APF uses the offset counting method traditionally used in previous
     ACE evaluation programs: 

       1) Each (UTF-8) character, not byte, is counted as one.  

       2) Each newline character is counted as one.  (The .sgm files
          use the UNIX-style end of line characters.)

       3) SGML tags are *not* counted towards offsets.  (Please note
          that the AG files included in this release do count SGML tags in
          offsets.)

       4) SGML entities are counted in terms of each character in the
          entities.  For example, "&amp;" is counted as five
          characters, not as one character.

   - TIMEX2

     The timex2 element represents TIMEX2 time expression annotations.
     Its optional attributes, such as "VAL" and "MOD", represent the
     TIMEX2 normalization values. 

   - TYPE, LDCTYPE and LDCATR in entity_mention

     The TYPE attribute in entity_mention stores the official ACE entity
     mention type, and the LDCTYPE and LDCATR attributes store the
     attributes used in the LDC's annotation process.

   - Name in entity_attributes

     The "name" element in entity_attributes stores the heads of
     "NAM"-type mentions as in the previous years.  In response to
     George Doddington's request, we have added the NAME attribute to
     the "name" element.  The NAME attribute stores slightly normalized
     versions of the names where:

     - \n is replaced with a space
     - multiple spaces are reduced to one space
     - " (double quote) is removed

     - Example:

     <entity_attributes>
        <name NAME="United States">
           <charseq START="4242" END="4254">United
     States</charseq>
        </name>
     </entity_attributes>

   - Nickname metonymy

     Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in
     entity_mentions.  "NAN"-type entity mentions marked as nickname
     metonymy do not give rise to name elements.

   - Cross-type metonymy

     "Cross-type" metonyms are represented with relations of the type
     METONYMY.  The METONYMY type relations do not have
     relation_mentions.  The METONYMY type relations are automatically
     generated after the annotation process, and are the only kind of
     relation annotations that appear in this corpus.

   - For more details, please refer to the APF V5.1.2 DTD.

9. DTDs

The following DTDs are in the dtd subdirectory.

     ace_source_sgml.v1.0.4.dtd - SGML DTD for .sgm files
     apf.v5.1.2.dtd             - XML DTD for APF files
     ag-1.1.dtd                 - XML DTD for AG files

10. Copyright Information

(c) 2005 Associated Press Newswire, (c) 2005 Xinhua News Agency, (c)
2006-2007 Trustees of University of Pennsylvania.


----
README Created January 14, 2007 Christopher R. Walker
       Updated January 16, 2007 Stephanie Strassel
       Updated January 25, 2007 Christopher R. Walker
       Updated April 14, 2014 Zhiyi Song
       Updated April 18, 2014 Zhiyi Song
       Updated April 18, 2014 David Graff