REFLEX Entity Translation Training/DevTest LDC2009T07 Linguistic Data Consortium 1. Introduction This corpus constitutes the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation. The total set of Training/DevTest data constitutes approximately 67.5k words for each of three languages: English, Chinese and Arabic. The data set is made up of 22.5k words of English data, 22.5k words of Chinese data, and 22.5k words of Arabic data translated into each of the other two languages. The "timex2norm" directories contain the "official" (i.e. final) versions of the annotation files (.apf.xml) and the source text files (.sgm). All files in the "timex2norm" directories have undergone all stages of the annotation pipeline, including all batch QC processes. Users of this data release should train or test with these files, not the files under the "1p" and "2p" directories -- which reflect "incomplete" data at various stages in the annotation pipeline. The apf.xml files included in this release use the APF DTD version 5.1.2, which was posted on the NIST ACE website on 12/18/2006. http://www.nist.gov/speech/tests/ace/ace07/doc/index.htm The only difference between version 5.1.1 and version 5.1.2 is that NOMPOST and NAMPOST have been added to the LDCTYPE attribute value list of entity_mention in version 5.1.2. These values are used only for Spanish, which is not included in this release. 2. Segment Alignment Segments are sentence-like units that are semi-automatically identified prior to translation. Translators are instructed to translate one segment at a time, but always within the context of the entire source document. This means that in most cases, the entities mentioned within a given segment for the source document will also be mentioned in that same segment for the translations of that document. This is not always the case, due to the occasional translation error, or to valid alternative syntactic or lexical choices by the translator that result in imperfect segment-entity alignment. This package contains two types of information for segment alignment. Segment Alignment Text (.seg) files contain lists of segments with their apf start- and end- character offsets for each language. Additionally, within the /doc directory is the file allEntities.tab, intended for viewing by humans. This is a tab-delimited table that lists every entity mentioned in a given segment for a document, across all three languages. The table is sorted by Document ID then by segment and includes EntityID, Type, Subtype, Head, Level and Class for every entity/language. Note that the entities themselves are not aligned or mapped in this table. Segment alignment information is not available for all documents. For some documents, different segmentations were used for each of the two translations that were completed. In other documents, substantial formatting or encoding changes were introduced during the translation process. In the following documents, the two versions of the translation file were disparate enough that the segments have not been aligned: - ALFILFILM_20050203.1756 - ALFILFILM_20050205.0832 - APW_ENG_20030424.0698 - APW_ENG_20030527.0232 - DIGRESSING_20041101.1921 - DIGRESSING_20041107.0106 - DIGRESSING_20041206.0246 - DIGRESSING_20041213.0439 - DIGRESSING_20050107.0236 - DIGRESSING_20050115.0958 - DIGRESSING_20050121.2236 - DIGRESSING_20050122.0117 - DIGRESSING_20050123.1900 - DIGRESSING_20050201.1820 - EGYDAYS_20050221.1227 - FLOPPINGACES_20041113.1528.042 - MARKETVIEW_20050127.0716 - chtb_254 3. Annotation 3.1 Tasks and Guidelines Data contained in this release has been annotated for the following tasks: - Entities - TIMEX2 extents & normalization The annotation guidelines used for this corpus were basically the same as the ACE 2005 guidelines. The ACE 2005 annotation guidelines for each language can be downloaded from LDC's ACE website: http://projects.ldc.upenn.edu/ace/ The annotation guidelines used for TIMEX2 annotation can be found here: http://projects.ldc.upenn.edu/ace/docs/English-TIMEX2-Guidelines_v0.1.pdf 3.2 Annotation Process Training/DevTest data are annotated for all tasks by one annotator and then second-pass annotated by a senior annotator or team leader. The first pass (complete) annotation is called 1P. The second pass (complete) annotation is called 2P. For 1P, a single junior annotator completes all tasks (entities and TIMEX2 extents) for a file. For 2P, a more experienced senior annotator reviews the first-pass annotations and corrects any errors they identify. Then, TIMEX2 values are normalized by an annotator who was specifically trained for the task. This task is known as NORM. After NORM, additional corpus-wide quality control (QC) spot-checks are conducted on normalized data by the team leader and selected senior annotators. The full annotation process for REFLEX Entity Translation Training/DevTest data is represented below: 1P: entities TIMEX2 extents | V 2P: entities TIMEX2 extents | V NORM: TIMEX2 normalization | V QC: entities TIMEX2 extents TIMEX2 normalization 4. Source Data Profile Below is a description of the data sources and epochs for the Training/DevTest data set. - REFLEX Entity Translation Training/DevTest * Sources: ACE 04, ACE 05 training pools - Newswire (NW): AFP_ENG (Agence France-Presse - English) - 2003.03-2003.06 AFA (Agence France-Presse - Arabic) - 2000.10-2000.12 ALH (Al Hayat - Arabic) - 2000.10-2000.12 ANN (An Nahar - Arabic) - 2000.10-2000.12 APW_ENG (Associated Press - English) - 2003.03-2003.06 XIN (Xinhua News Agency - Chinese) - 2000.10-2000.12 XIN_ENG (Xinhua News Agency - English) - 2003.03-2003.06 ZBN (Zaobao News Agency - Chinese) - 2000.10-2000.12 Note: Files taken from ACE04 have three-letter source IDs (e.g., AFA). Files taken from ACE05 use a newer convention: a three-letter source ID, an underscore and a three-letter language ID (e.g., APW_ENG). - Weblog (WL): Various sources - 2004.11-2005.02 - Arabic Treebank (ATB): ANN (An Nahar) - 2002.01-2002.05 - Chinese Treebank (CTB): Xinhua News Agency - 1994.09-1998.01 * 3-way translation - 22.5 Kw English -> Arabic, Chinese - 22.5 Kw Chinese -> Arabic, English - 22.5 Kw Arabic -> Chinese, English - Untranslatable text: In some cases, a word or phrase was deemed "untranslatable" by the professional translation agencies. In cases where a word or phrase cannot be translated into a target language, an empty "" tag has been included in the translation text file. * Total of approximately 67,500 words/language The word counts vary by language since translated files often have larger or smaller word counts than the files in the source language. In particular, English files appear to have larger word counts than the corresponding Arabic files. 5. Annotation Data Profile Below is information about the amount of data included in the current release and its annotation status. - 1P: data subject to first pass (complete) annotation - 2P: data subject to second pass (complete) annotation - NORM: data also subject to TIMEX2 normalization Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. - Arabic-source =========words========== ==========files====== 1P 2P NORM 1P 2P NORM Arabic ATB 4455 4455 4455 9 9 9 Chinese ATB 9680 9680 9680 9 9 9 English ATB 6233 6233 6233 9 9 9 Arabic NW 7600 7600 7600 48 48 48 Chinese NW 9954 9954 9954 48 48 48 English NW 16657 16657 16657 48 48 48 Arabic WL 10638 10638 10638 38 38 38 Chinese WL 23312 23312 23312 38 38 38 English WL 14563 14563 14563 38 38 38 ------------------------- --------------------- Arb Total 22693 22693 22693 95 95 95 Chn Total 49649 49649 49649 95 95 95 Eng Total 30750 30750 30750 95 95 95 - Chinese-source =========words========== ==========files====== 1P 2P NORM 1P 2P NORM Arabic CTB 2505 2505 2505 15 15 15 Chinese CTB 4782 2782 4782 15 15 15 English CTB 3244 3244 3244 15 15 15 Arabic NW 7465 7465 7465 22 22 22 Chinese NW 13589 13589 13589 22 22 22 English NW 8980 8980 8980 22 22 22 Arabic WL 8847 8847 8847 19 19 19 Chinese WL 15364 15364 15364 19 19 19 English WL 10761 10761 10761 19 19 19 ------------------------- --------------------- Arb Total 18817 18817 18817 56 56 56 Chn Total 33735 33735 33735 56 56 56 Eng Total 22985 22985 22985 56 56 56 - English-source =========words========== ==========files====== 1P 2P NORM 1P 2P NORM Arabic NW 15106 15106 15106 34 34 34 Chinese NW 31721 31721 31721 34 34 34 English NW 17104 17104 17104 34 34 34 Arabic WL 8918 8918 8918 29 29 29 Chinese WL 18953 18953 18953 29 29 29 English WL 10302 10302 10302 29 29 29 ------------------------- --------------------- Arb Total 24024 24024 24024 63 63 63 Chn Total 50674 50674 50674 63 63 63 Eng Total 27406 27406 27406 63 63 63 6. Data Directory Structure The data are organized by source language: data/arbic-source/ data/chinese-source/ data/english-source/ Then by data type: atb/ ctb/ nw/ wl/ Then by annotation status: - 1p: data subject to first pass (complete) annotation - 2p: data subject to second pass (complete) annotation - timex2norm: data also subject to TIMEX2 normalization, plus additional QC. So for instance, if a Chinese source file has been fully annotated, you will find an .apf.xml annotation file of the source Chinese and the English and Arabic translations in each of "1p", "2p" and "timex2norm". Here is an example: Chinese source: data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.apf.xml English translation: data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.eng.apf.xml Arabic translation: data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.arb.apf.xml The "FileList" files contain information about the word (for English and Arabic) or character (for Chinese) counts and annotation status for each file in the release. The "doc" directory contains segment-aligned entity mentions from each language. Entity Mention Table (.tdf) File: doc/allEntities.tdf This report contains all mentions from all languages for all documents in a single tab-delimited text file with the following fields: - docid - segment number - English entity id - English entity type - English entity subtype - English mention head - English mention type (level) - English entity class - Chinese entity id - Chinese entity type - Chinese entity subtype - Chinese mention head - Chinese mention type (level) - Chinese entity class - Arabic entity id - Arabic entity type - Arabic entity subtype - Arabic mention head - Arabic mention type (level) - Arabic entity class No attempt has been made to align entity mentions between languages. In other words, the English entity mentions for a segment are listed in the order that they appear in the English text, the Chinese mentions in the order they appear in the Chinese text, and the Arabic mentions in the order they appear in the Arabic text. Entity mentions on the same line of the .tdf file may or may not correspond to mentions of the same entity across languages. 7. File Format Description Each directory contains files of the following formats. For most users, the most important files are the .sgm files and .apf.xml files. Source Text (.sgm) Files - These files contain the source text files in an SGM format. These files use the UNIX-style end of lines. All .sgm files are in UTF-8. ACE Program Format (APF) (.apf.xml) Files - These files are in the official ACE annotation file format. See section 8 for more details. AG (.ag.xml) Files - These are annotation files created with the LDC's annotation toolkit. These files have been converted to the corresponding .apf.xml files. ID table (.tab) Files - These files store mapping tables between the IDs used in the ag.xml files and their corresponding apf.xml files. 8. Data Validation Below is a description of the sanity checks and other format validation steps applied to annotation files created by LDC. Checks included in the annotation tool or applied automatically: -- Extents stripped of all spaces and punctuation at front and back -- GPE mentions without roles were fixed -- For non-GPE mentions with roles, roles were removed -- All non-complex entity mentions have heads. For APF, this means that all entity mentions have heads -- All NAMPRE and NOMPRE GPE mentions have GPE as their role -- All files have exactly one timex2 annotation in the DATETIME field -- No annotation extents overlap without nesting (entity mention, entity mention head, timex2 mention) -- There are no annotations inside of sgm tags -- All entities have permissible type-subtype pairs -- All files successfully convert to APF -- All APF files validate against DTD -- All APF files can be scored against themselves -- Search for untagged pronouns (English, Arabic) -- Search for English Building-Grounds mentions containing "Airport" or "Airfield" -- Search for untagged relative clauses (English) -- Search for demonstratives tagged as WHQ (Arabic) -- Scan all unannotated common TIMEX2 and value triggers (English) -- Check that all POSTDATEs are annotated -- Check for missing SPEAKER annotations -- Check for missing POSTER annotations Checks applied after annotation as additional QC: -- No English passages are annotated in non-English files -- All instances of cross-type metonymy manually reviewed -- All instances of co-extensive entity mentions with the same heads manually reviewed -- Manual scan of all NOM heads with different entity type/subtype values in different parts of the corpus (normalized files only) -- Manual scan of all NAM heads with different entity type/subtype values in different parts of the corpus (normalized files only) -- Manual scan of all entity mention heads by entity type/subtype for outliers in normalized files -- Manually examine and correct or describe all fatal errors and warnings generated by the most recent version of the scorer -- Manually review cases where a SPEAKER or POSTER annotation is not coreferenced with an entity mention outside of SPEAKER or POSTER tags -- Manually review timex2 normalization values 9. Notes About APF - Offsets APF uses the offset counting method traditionally used in previous ACE evaluation programs: 1) Each (UTF-8) character, not byte, is counted as one. 2) Each newline character is counted as one. (The .sgm files use the UNIX-style end of line characters.) 3) SGML tags are *not* counted towards offsets. (Please note that the AG files included in this release do count SGML tags in offsets.) 4) SGML entities are counted in terms of each character in the entities. For example, "&" is counted as five characters, not as one character. - TIMEX2 The timex2 element represents TIME2 timex expression annotations. Its optional attributes, such as "VAL" and "MOD", represent the TIMEX2 normalization values. - TYPE, LDCTYPE and LDCATR in entity_mention The TYPE attribute of entity_mention store the official ACE entity mention types, and the LDCTYPE and LDCATR attributes store the attributes used in the LDC's annotation process. - Name in entity_attributes The "name" element in entity_attributes stores the heads of "NAM"-type mentions as in the previous years. In response to George Doddington's request, we have added the NAME attribute to the "name" element. The NAME attribute stores slightly normalized versions of the names where: - \n is replaced with a space - multiple spaces are reduced to one space - " (double quote) is removed - Example: United States - Nickname metonymy Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in entity_mentions. "NAN"-type entity mentions marked as nickname metonymy do not give rise to name elements. - Cross-type metonymy "Cross-type" metonyms are represented with relations of the type METONYMY. The METONYMY type relations do not have relation_mentions. The METONYMY type relations are automatically generated after the annotation process, and are the only kind of relation annotations that appear in this corpus. - For more details, please refer to the APF V5.1.2 DTD. 10. DTDs The following DTDs are in the dtd subdirectory. apf.v5.1.2.dtd - XML DTD for APF files ace_source_sgml.v1.0.4.dtd - SGML DTD for .sgm files ag-1.1.dtd - XML DTD for AG files 11. Copyright Information Portions (c) 1994-1998, 2000, 2003 Xinhua News Agency, (c) 2000, 2003 Agence France-Presse, (c) 2003 Associated Press Newswire, (c) 2000, Al-Hayat, (c) 2000, 2002 An-Nahar, (c) 1994-2009 Trustees of University of Pennsylvania. 12. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Zhiyi Song - REFLEX-MTE Project Manager Stephanie Strassel - LDC Annotation Group Director/REFLEX-MTE Consultant Kazuaki Maeda - Technical Consultant/Manager The following former members of LDC also contributed to the creation of this corpus. Christopher Walker - REFLEX-MTE Project Manager Julie Medero - REFLEX-MTE Lead Developer -------------------------------------------------------------------------- README Updated for LDC General Publication January 16, 2009 Kazuaki Maeda