ACE Spanish DevTest - 2007 Pilot Evaluation Authors: Christopher Walker, Zhiyi Song, Kazuaki Maeda, Stephanie Strassel 1. Introduction This file contains documentation on the ACE Spanish DevTest - 2007 Pilot Evaluation, Linguistic Data Consortium (LDC) catalog number LDC2015T20 and ISBN 1-58563-730-0. This publication contains the complete set of of DevTest data for the ACE 2007 Spanish Pilot Evaluation to support the 2007 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of the newswire types annotated for entities and Timex2 and was created by Linguistic Data Consortium with support from the ACE Program. This data was previously distributed as an e-corpus (LDC2007E10) to participants in the 2007 ACE evaluation. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In January and February 2007, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks were performed for English and Chinese. Entity and Temporal expression recognition were performed for Arabic and Spanish. The current publication comprises the devTest data for the Spanish pilot evaluation. A complete description of the ACE 2007 Evaluation can be found on the ACE Program web site maintained by the National Institute of Standards and Technology (NIST): http://www.nist.gov/speech/tests/ace/ 2. Annotation 2.1 Tasks and Guidelines Data contained in this release has been annotated for the following tasks: - Entities - TIMEX2 extents & normalization The annotation guidelines used for this corpus were basically the same as the ACE 2005 guidelines. They can be found under /docs Spanish Entity: Spanish-Entities-Guidelines_v1.6.pdf Timex2: TIMEX2-Guidelines_v0.1.pdf 2.2 Annotation Process DevTest data files in this package are dually annotated for all tasks by two annotators working independently, like all previously released devTest data in other languages. The first pass annotation is called 1P; the independent dual first pass annotation is called DUAL. For both 1P and DUAL, a single annotator completes all tasks for a file. Files are assigned via an automated Annotation Work-flow System (AWS), and file assignment is double-blind. ** NOTE: 1P and DUAL are stored as 'fp1' and 'fp2' in this package. Discrepancies between the 1P and DUAL version of each file are then adjudicated by a senior annotator or team leader, resulting in a high-quality gold standard file. The gold standard adjudicated file is known as ADJ. After adjudication, TIMEX2 values are normalized, which is known as NORM. The complete set of the data in this release has been NORM annotated. After NORM, additional corpus-wide quality control (QC) spot-checks are conducted on normalized data by the team leader and selected senior annotators. The full annotation process for Spanish 2007 DevTest is represented below: 1P: entities DUAL: entities TIMEX2 extents TIMEX2 extents | | | | |____________________| | | | V ADJ: entities TIMEX2 extents | | | V NORM: TIMEX2 normalization 3. Source Data Profile 3.1 Data Selection Process Data selection is semi-automatic. A document pool is established for each language based on the genre and epoch requirements. Humans then review the pool to select individual documents that are suitable for ACE annotation, for instance documents that are representative of their genre and contain targeted ACE entity types. 3.2 Training Data Sources and Epochs Below is a description of the data sources and epochs for the DevTest data set. Spanish * Newswire (NW): 100% sources: Spanish Gigaword Corpus (AFP, APW, Xinhua) AFP_SPA - Agence France-Presse, Spanish Service APW_SPA - Associated Press Worldstream, Spanish Service XIN_SPA - Xinhua, Spanish Service training epoch: Jan-Apr 2005 dev test epoch: May 2005 test epoch: Jun 2005 4. Annotation Data Profile Below is information about the amount of data included in the current release and its annotation status. - 1P: data subject to first pass annotation - ADJ: data also subject to discrepancy resolution/adjudication - NORM: data also subject to TIMEX2 normalization - Spanish Word Counts: =========words======== ========files======== 1P ADJ NORM 1P ADJ NORM NW 53517 53517 53517 167 167 167 ---------------------- --------------------- Total 53517 53517 53517 167 167 167 5. Data Directory Structure The data are organized by language, data type and annotation status as follows: - fp1: data subject to first pass (complete) annotation - fp2: data subject to first pass (complete) annotation - adj: data subject to adjudication (complete) annotation - timex2norm: data also subject to TIMEX2 normalization, plus additional QC. For a given document, you will find its source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "fp1", "fp2", "adj" and "timex2norm". In other words, for each newswire story, the three annotation directories each contain an identical copy of the source text (.sgm file) along with distinct versions of the associated annotations (.ag.xml, apf.xml and .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. The file "FileList.tab", in the docs/ directory, contains information about the word counts and annotation status for each file in the release. The following indicates where to find the completed annotation files and their corresponding source files. */timex2norm/*sgm */timex2norm/*apf.xml 6. File Format Description Each directory contains files of the following formats. For most users, the most important files are the .sgm files and .apf.xml files. Source Text (.sgm) Files - These files contain the source text data in an SGML format; they use UTF-8 encoding and UNIX-style line termination. AG (.ag.xml) Files - These are annotation files created with the LDC's annotation toolkit. These files have been converted to the corresponding .apf.xml files. ACE Program Format (APF) (.apf.xml) Files - These files are in the official ACE annotation file format. ACE format is derived by means of a routine format conversion process, so that the underlying annotation content of the two files is equivalent See section 8 for more details. ID table (.tab) Files - These files store mapping tables between the IDs used in the ag.xml files and their corresponding apf.xml files. 7. Data Validation Below is a description of the sanity checks and other format validation steps applied to annotation files created by LDC. Checks included in the annotation tool or applied automatically: -- Extents stripped of all spaces and punctuation at front and back -- GPE mentions without roles were fixed -- For non-GPE mentions with roles, roles were removed -- All non-complex entity mentions have heads. For APF, this means that all entity mentions have heads -- All NAMPRE and NOMPRE GPE mentions have GPE as their role -- All files have exactly one timex2 annotation in the DATETIME field -- No annotation extents overlap without nesting (entity mention, entity mention head, timex2 mention) -- There are no annotations inside of sgm tags -- All entities have permissible type-subtype pairs -- All files successfully convert to APF -- All APF files validate against DTD -- All APF files can be scored against themselves -- Search for demonstratives tagged as WHQ Checks applied after annotation as additional QC -- All instances of cross-type metonymy manually reviewed -- All instances of co-extensive entity mentions with the same heads manually reviewed -- Manually examine and correct or describe all fatal errors and warnings generated by the most recent version of the scorer -- Manual scan of all NOM heads with different entity type/subtype values in different parts of the corpus (normalized files only) -- Manual scan of all NAM heads with different entity type/subtype values in different parts of the corpus (normalized files only) -- Manual scan of all entity mention heads by entity type/subtype for outliers in normalized files -- Checked for Multiple-word non-NAM heads and corrected them -- Checked for other 'multi-word' WHQ pronouns and corrected them (i.e. [l...] as WHQ) -- Checked for relative clauses not annotated in their respective mention NPs ( searched for '... [NOM] [WHQ] ...' and corrected by hand) -- Checked for and corrected all cases where a determiner is in the scope of a mention's head -- Checked all cases where a PER mention was used as a POST modifier -- Checked an extensional list of Nation names tagged as ORG.GOV -- Checked an extensional list of WHQ terms not tagged as WHQ -- Looked for Parentheses in the head of any mentions -- Checked an extensional list of WEA.Exploding terms not tagged as WEA.Exploding -- Checked that EU was always tagged as 'GPE.GPE-Cluster' -- Manually review timex2 normalization values 8. Notes About APF - Offsets APF uses the offset counting method traditionally used in previous ACE evaluation programs: 1) Each (UTF-8) character, not byte, is counted as one. 2) Each newline character is counted as one. (The .sgm files use the UNIX-style end of line characters.) 3) SGML tags are *not* counted towards offsets. (Please note that the AG files included in this release do count SGML tags in offsets.) 4) SGML entities are counted in terms of each character in the entities. For example, "&" is counted as five characters, not as one character. - TIMEX2 The timex2 element represents TIMEX2 time expression annotations. Its optional attributes, such as "VAL" and "MOD", represent the TIMEX2 normalization values. - TYPE, LDCTYPE and LDCATR in entity_mention The TYPE attribute in entity_mention stores the official ACE entity mention type, and the LDCTYPE and LDCATR attributes store the attributes used in the LDC's annotation process. - Name in entity_attributes The "name" element in entity_attributes stores the heads of "NAM"-type mentions as in the previous years. In response to George Doddington's request, we have added the NAME attribute to the "name" element. The NAME attribute stores slightly normalized versions of the names where: - \n is replaced with a space - multiple spaces are reduced to one space - " (double quote) is removed - Example: United States - Nickname metonymy Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in entity_mentions. "NAN"-type entity mentions marked as nickname metonymy do not give rise to name elements. - Cross-type metonymy "Cross-type" metonyms are represented with relations of the type METONYMY. The METONYMY type relations do not have relation_mentions. The METONYMY type relations are automatically generated after the annotation process, and are the only kind of relation annotations that appear in this corpus. - For more details, please refer to the APF V5.1.2 DTD. 9. DTDs The following DTDs are in the dtd subdirectory. ace_source_sgml.v1.0.4.dtd - SGML DTD for .sgm files apf.v5.1.2.dtd - XML DTD for APF files ag-1.1.dtd - XML DTD for AG files 10. Copyright Information (c) 2005 Associated Press Newswire, (c) 2005 Xinhua News Agency, (c) 2006-2007 Trustees of University of Pennsylvania. ---- README Created January 14, 2007 Christopher R. Walker Updated January 16, 2007 Stephanie Strassel Updated January 25, 2007 Christopher R. Walker Updated April 14, 2014 Zhiyi Song Updated April 18, 2014 Zhiyi Song Updated April 18, 2014 David Graff