REFLEX Entity Translation Training/DevTest
LDC2009T07
Linguistic Data Consortium
1. Introduction
This corpus constitutes the complete set of training data and
development test data for the 2007 REFLEX Entity Translation
evaluation. The total set of Training/DevTest data constitutes
approximately 67.5k words for each of three languages: English,
Chinese and Arabic. The data set is made up of 22.5k words of
English data, 22.5k words of Chinese data, and 22.5k words of Arabic
data translated into each of the other two languages.
The "timex2norm" directories contain the "official" (i.e. final)
versions of the annotation files (.apf.xml) and the source text
files (.sgm). All files in the "timex2norm" directories have
undergone all stages of the annotation pipeline, including all batch
QC processes. Users of this data release should train or test with
these files, not the files under the "1p" and "2p" directories --
which reflect "incomplete" data at various stages in the annotation
pipeline.
The apf.xml files included in this release use the APF DTD version
5.1.2, which was posted on the NIST ACE website on 12/18/2006.
http://www.nist.gov/speech/tests/ace/ace07/doc/index.htm
The only difference between version 5.1.1 and version 5.1.2 is that
NOMPOST and NAMPOST have been added to the LDCTYPE attribute value
list of entity_mention in version 5.1.2. These values are used only
for Spanish, which is not included in this release.
2. Segment Alignment
Segments are sentence-like units that are semi-automatically
identified prior to translation. Translators are instructed to
translate one segment at a time, but always within the context of
the entire source document. This means that in most cases, the
entities mentioned within a given segment for the source document
will also be mentioned in that same segment for the translations of
that document. This is not always the case, due to the occasional
translation error, or to valid alternative syntactic or lexical
choices by the translator that result in imperfect segment-entity
alignment.
This package contains two types of information for segment
alignment. Segment Alignment Text (.seg) files contain lists of
segments with their apf start- and end- character offsets for each
language. Additionally, within the /doc directory is the file
allEntities.tab, intended for viewing by humans. This is a
tab-delimited table that lists every entity mentioned in a given
segment for a document, across all three languages. The table is
sorted by Document ID then by segment and includes EntityID, Type,
Subtype, Head, Level and Class for every entity/language. Note that
the entities themselves are not aligned or mapped in this table.
Segment alignment information is not available for all documents.
For some documents, different segmentations were used for each of
the two translations that were completed. In other documents,
substantial formatting or encoding changes were introduced during
the translation process. In the following documents, the two
versions of the translation file were disparate enough that the
segments have not been aligned:
- ALFILFILM_20050203.1756
- ALFILFILM_20050205.0832
- APW_ENG_20030424.0698
- APW_ENG_20030527.0232
- DIGRESSING_20041101.1921
- DIGRESSING_20041107.0106
- DIGRESSING_20041206.0246
- DIGRESSING_20041213.0439
- DIGRESSING_20050107.0236
- DIGRESSING_20050115.0958
- DIGRESSING_20050121.2236
- DIGRESSING_20050122.0117
- DIGRESSING_20050123.1900
- DIGRESSING_20050201.1820
- EGYDAYS_20050221.1227
- FLOPPINGACES_20041113.1528.042
- MARKETVIEW_20050127.0716
- chtb_254
3. Annotation
3.1 Tasks and Guidelines
Data contained in this release has been annotated for the following tasks:
- Entities
- TIMEX2 extents & normalization
The annotation guidelines used for this corpus were basically the
same as the ACE 2005 guidelines.
The ACE 2005 annotation guidelines for each language can be
downloaded from LDC's ACE website:
http://projects.ldc.upenn.edu/ace/
The annotation guidelines used for TIMEX2 annotation can be found
here:
http://projects.ldc.upenn.edu/ace/docs/English-TIMEX2-Guidelines_v0.1.pdf
3.2 Annotation Process
Training/DevTest data are annotated for all tasks by one annotator
and then second-pass annotated by a senior annotator or team
leader. The first pass (complete) annotation is called 1P. The
second pass (complete) annotation is called 2P. For 1P, a single
junior annotator completes all tasks (entities and TIMEX2 extents)
for a file. For 2P, a more experienced senior annotator reviews the
first-pass annotations and corrects any errors they identify. Then,
TIMEX2 values are normalized by an annotator who was specifically
trained for the task. This task is known as NORM. After NORM,
additional corpus-wide quality control (QC) spot-checks are
conducted on normalized data by the team leader and selected senior
annotators.
The full annotation process for REFLEX Entity Translation
Training/DevTest data is represented below:
1P: entities
TIMEX2 extents
|
V
2P: entities
TIMEX2 extents
|
V
NORM: TIMEX2 normalization
|
V
QC: entities
TIMEX2 extents
TIMEX2 normalization
4. Source Data Profile
Below is a description of the data sources and epochs for the
Training/DevTest data set.
- REFLEX Entity Translation Training/DevTest
* Sources: ACE 04, ACE 05 training pools
- Newswire (NW):
AFP_ENG (Agence France-Presse - English) - 2003.03-2003.06
AFA (Agence France-Presse - Arabic) - 2000.10-2000.12
ALH (Al Hayat - Arabic) - 2000.10-2000.12
ANN (An Nahar - Arabic) - 2000.10-2000.12
APW_ENG (Associated Press - English) - 2003.03-2003.06
XIN (Xinhua News Agency - Chinese) - 2000.10-2000.12
XIN_ENG (Xinhua News Agency - English) - 2003.03-2003.06
ZBN (Zaobao News Agency - Chinese) - 2000.10-2000.12
Note: Files taken from ACE04 have three-letter source IDs
(e.g., AFA). Files taken from ACE05 use a newer
convention: a three-letter source ID, an underscore
and a three-letter language ID (e.g., APW_ENG).
- Weblog (WL):
Various sources - 2004.11-2005.02
- Arabic Treebank (ATB):
ANN (An Nahar) - 2002.01-2002.05
- Chinese Treebank (CTB):
Xinhua News Agency - 1994.09-1998.01
* 3-way translation
- 22.5 Kw English -> Arabic, Chinese
- 22.5 Kw Chinese -> Arabic, English
- 22.5 Kw Arabic -> Chinese, English
- Untranslatable text: In some cases, a word or phrase was
deemed "untranslatable" by the professional translation
agencies. In cases where a word or phrase cannot be
translated into a target language, an empty
"" tag has been included in the
translation text file.
* Total of approximately 67,500 words/language
The word counts vary by language since translated files often
have larger or smaller word counts than the files in the source
language. In particular, English files appear to have larger
word counts than the corresponding Arabic files.
5. Annotation Data Profile
Below is information about the amount of data included in the
current release and its annotation status.
- 1P: data subject to first pass (complete) annotation
- 2P: data subject to second pass (complete) annotation
- NORM: data also subject to TIMEX2 normalization
Note: Chinese data expressed in terms of characters. We assume
a correspondence of roughly 1.5 characters/word.
- Arabic-source
=========words========== ==========files======
1P 2P NORM 1P 2P NORM
Arabic ATB 4455 4455 4455 9 9 9
Chinese ATB 9680 9680 9680 9 9 9
English ATB 6233 6233 6233 9 9 9
Arabic NW 7600 7600 7600 48 48 48
Chinese NW 9954 9954 9954 48 48 48
English NW 16657 16657 16657 48 48 48
Arabic WL 10638 10638 10638 38 38 38
Chinese WL 23312 23312 23312 38 38 38
English WL 14563 14563 14563 38 38 38
------------------------- ---------------------
Arb Total 22693 22693 22693 95 95 95
Chn Total 49649 49649 49649 95 95 95
Eng Total 30750 30750 30750 95 95 95
- Chinese-source
=========words========== ==========files======
1P 2P NORM 1P 2P NORM
Arabic CTB 2505 2505 2505 15 15 15
Chinese CTB 4782 2782 4782 15 15 15
English CTB 3244 3244 3244 15 15 15
Arabic NW 7465 7465 7465 22 22 22
Chinese NW 13589 13589 13589 22 22 22
English NW 8980 8980 8980 22 22 22
Arabic WL 8847 8847 8847 19 19 19
Chinese WL 15364 15364 15364 19 19 19
English WL 10761 10761 10761 19 19 19
------------------------- ---------------------
Arb Total 18817 18817 18817 56 56 56
Chn Total 33735 33735 33735 56 56 56
Eng Total 22985 22985 22985 56 56 56
- English-source
=========words========== ==========files======
1P 2P NORM 1P 2P NORM
Arabic NW 15106 15106 15106 34 34 34
Chinese NW 31721 31721 31721 34 34 34
English NW 17104 17104 17104 34 34 34
Arabic WL 8918 8918 8918 29 29 29
Chinese WL 18953 18953 18953 29 29 29
English WL 10302 10302 10302 29 29 29
------------------------- ---------------------
Arb Total 24024 24024 24024 63 63 63
Chn Total 50674 50674 50674 63 63 63
Eng Total 27406 27406 27406 63 63 63
6. Data Directory Structure
The data are organized by source language:
data/arbic-source/
data/chinese-source/
data/english-source/
Then by data type:
atb/
ctb/
nw/
wl/
Then by annotation status:
- 1p: data subject to first pass (complete) annotation
- 2p: data subject to second pass (complete) annotation
- timex2norm: data also subject to TIMEX2 normalization, plus
additional QC.
So for instance, if a Chinese source file has been fully annotated,
you will find an .apf.xml annotation file of the source Chinese and
the English and Arabic translations in each of "1p", "2p" and
"timex2norm". Here is an example:
Chinese source:
data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.apf.xml
English translation:
data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.eng.apf.xml
Arabic translation:
data/chinese-source/nw/timex2norm/XIN20001218.2000.0158.arb.apf.xml
The "FileList" files contain information about the word (for English
and Arabic) or character (for Chinese) counts and annotation status
for each file in the release.
The "doc" directory contains segment-aligned entity mentions from
each language.
Entity Mention Table (.tdf) File: doc/allEntities.tdf
This report contains all mentions from all languages for all
documents in a single tab-delimited text file with the
following fields:
- docid
- segment number
- English entity id
- English entity type
- English entity subtype
- English mention head
- English mention type (level)
- English entity class
- Chinese entity id
- Chinese entity type
- Chinese entity subtype
- Chinese mention head
- Chinese mention type (level)
- Chinese entity class
- Arabic entity id
- Arabic entity type
- Arabic entity subtype
- Arabic mention head
- Arabic mention type (level)
- Arabic entity class
No attempt has been made to align entity mentions between
languages. In other words, the English entity mentions for a
segment are listed in the order that they appear in the English
text, the Chinese mentions in the order they appear in the
Chinese text, and the Arabic mentions in the order they appear
in the Arabic text. Entity mentions on the same line of the
.tdf file may or may not correspond to mentions of the same
entity across languages.
7. File Format Description
Each directory contains files of the following formats. For most
users, the most important files are the .sgm files and .apf.xml
files.
Source Text (.sgm) Files
- These files contain the source text files in an SGM format.
These files use the UNIX-style end of lines. All .sgm files are
in UTF-8.
ACE Program Format (APF) (.apf.xml) Files
- These files are in the official ACE annotation file format. See
section 8 for more details.
AG (.ag.xml) Files
- These are annotation files created with the LDC's annotation
toolkit. These files have been converted to the corresponding
.apf.xml files.
ID table (.tab) Files
- These files store mapping tables between the IDs used in the
ag.xml files and their corresponding apf.xml files.
8. Data Validation
Below is a description of the sanity checks and other format
validation steps applied to annotation files created by LDC.
Checks included in the annotation tool or applied automatically:
-- Extents stripped of all spaces and punctuation at front and back
-- GPE mentions without roles were fixed
-- For non-GPE mentions with roles, roles were removed
-- All non-complex entity mentions have heads. For APF, this means
that all entity mentions have heads
-- All NAMPRE and NOMPRE GPE mentions have GPE as their role
-- All files have exactly one timex2 annotation in the DATETIME field
-- No annotation extents overlap without nesting (entity mention,
entity mention head, timex2 mention)
-- There are no annotations inside of sgm tags
-- All entities have permissible type-subtype pairs
-- All files successfully convert to APF
-- All APF files validate against DTD
-- All APF files can be scored against themselves
-- Search for untagged pronouns (English, Arabic)
-- Search for English Building-Grounds mentions containing "Airport" or
"Airfield"
-- Search for untagged relative clauses (English)
-- Search for demonstratives tagged as WHQ (Arabic)
-- Scan all unannotated common TIMEX2 and value triggers (English)
-- Check that all POSTDATEs are annotated
-- Check for missing SPEAKER annotations
-- Check for missing POSTER annotations
Checks applied after annotation as additional QC:
-- No English passages are annotated in non-English files
-- All instances of cross-type metonymy manually reviewed
-- All instances of co-extensive entity mentions with the same heads
manually reviewed
-- Manual scan of all NOM heads with different entity type/subtype
values in different parts of the corpus (normalized files only)
-- Manual scan of all NAM heads with different entity type/subtype
values in different parts of the corpus (normalized files only)
-- Manual scan of all entity mention heads by entity type/subtype for
outliers in normalized files
-- Manually examine and correct or describe all fatal errors and warnings
generated by the most recent version of the scorer
-- Manually review cases where a SPEAKER or POSTER annotation is not
coreferenced with an entity mention outside of SPEAKER or
POSTER tags
-- Manually review timex2 normalization values
9. Notes About APF
- Offsets
APF uses the offset counting method traditionally used in previous
ACE evaluation programs:
1) Each (UTF-8) character, not byte, is counted as one.
2) Each newline character is counted as one. (The .sgm files
use the UNIX-style end of line characters.)
3) SGML tags are *not* counted towards offsets. (Please note
that the AG files included in this release do count SGML tags in
offsets.)
4) SGML entities are counted in terms of each character in the
entities. For example, "&" is counted as five
characters, not as one character.
- TIMEX2
The timex2 element represents TIME2 timex expression annotations.
Its optional attributes, such as "VAL" and "MOD", represent the
TIMEX2 normalization values.
- TYPE, LDCTYPE and LDCATR in entity_mention
The TYPE attribute of entity_mention store the official ACE entity
mention types, and the LDCTYPE and LDCATR attributes store the
attributes used in the LDC's annotation process.
- Name in entity_attributes
The "name" element in entity_attributes stores the heads of
"NAM"-type mentions as in the previous years. In response to
George Doddington's request, we have added the NAME attribute to
the "name" element. The NAME attribute stores slightly normalized
versions of the names where:
- \n is replaced with a space
- multiple spaces are reduced to one space
- " (double quote) is removed
- Example:
United
States
- Nickname metonymy
Nickname metonyms are indicated with METONYMY_MENTION="TRUE" in
entity_mentions. "NAN"-type entity mentions marked as nickname
metonymy do not give rise to name elements.
- Cross-type metonymy
"Cross-type" metonyms are represented with relations of the type
METONYMY. The METONYMY type relations do not have
relation_mentions. The METONYMY type relations are automatically
generated after the annotation process, and are the only kind of
relation annotations that appear in this corpus.
- For more details, please refer to the APF V5.1.2 DTD.
10. DTDs
The following DTDs are in the dtd subdirectory.
apf.v5.1.2.dtd - XML DTD for APF files
ace_source_sgml.v1.0.4.dtd - SGML DTD for .sgm files
ag-1.1.dtd - XML DTD for AG files
11. Copyright Information
Portions (c) 1994-1998, 2000, 2003 Xinhua News Agency, (c) 2000,
2003 Agence France-Presse, (c) 2003 Associated Press Newswire, (c)
2000, Al-Hayat, (c) 2000, 2002 An-Nahar, (c) 1994-2009 Trustees of
University of Pennsylvania.
12. Contact Information
If you have questions about this data release, please contact the
following personnel at the LDC.
Zhiyi Song - REFLEX-MTE Project Manager
Stephanie Strassel - LDC Annotation Group
Director/REFLEX-MTE
Consultant
Kazuaki Maeda - Technical Consultant/Manager
The following former members of LDC also contributed to the
creation of this corpus.
Christopher Walker - REFLEX-MTE Project Manager
Julie Medero - REFLEX-MTE Lead Developer
--------------------------------------------------------------------------
README Updated for LDC General Publication January 16, 2009 Kazuaki Maeda