ACE 2005 English SpatialML Annotations Version 2, Linguistic Data Consortium (LDC) catalog number LDC2011T02 and isbn 1-58563-573-1, was developed by researchers at The MITRE Corporation and applies SpatialML tags to the English newswire and broadcast training data annotated for entities, relations and events in ACE 2005 Multilingual Training Corpus LDC2006T06. This second version eliminates a number of annotation inconsistencies and errors identified in ACE 2005 English SpatialML Annotations LDC2008T03. In addition, the SpatialML annotation schema has been updated from version 2.0 to version 3.0.1 the revised annotation guidelines are included in this release.
The ACE (Automatic Content Extraction) program focused on developing automatic content extraction technology to support automatic processing of human language in text form., specifically, entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. It is intended to emulate earlier progress on time expression such as TIMEX2, TimeML, and the 2005 ACE guidelines.
SpatialML includes syntax for marking up PLACEs mentioned in text and for linking them to data from gazetteers and other databases. LINKs are used to express relations between places, and RLINKs to capture trajectories for relative locations. To the extent possible, SpatialML leverages ISO and other standards with the goal of making the scheme compatible with existing and future corpora. SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such markup can be useful for disambiguation, integration with mapping services and spatial reasoning.
This corpus contains 210065 total words and 17821 unique words. Counts of unique words can be found in doc/ldc_wordcount.csv which includes all words that are not part of XML markup (e.g., without tag names, attribute names or values). Unique words are counted by comparing case insensitive transformations with preceding and trailing punctuation stripped off. Words consisting solely of punctuation are discarded.
The principal change in the annotation schema is that PATH has been generalized to RLINK for relative link. At the top level, there is now a version attribute on the root SpatialML tag to capture which version of SpatialML was used. A number of smaller changes have been made to the annotation specification these are listed in Section 2 of the updated guidelines.
The files are provided in both in-line xml format and aif format.
The gaz-deref files contain multiple gazetteer references when they exist for a single location these different gazrefs sometimes correspond to slightly different latlongs. The sgm.dtd validated files do not contain document structure tags (such as , ) that would prevent them from being validated with the SpatialML DTD. These files total 22624650 bytes uncompressed.
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2011T02.
Portions © 2003 Agence France Presse, © 2003 The Associated Press, © 2003 Cable News Network, LP, LLLP, © 2007, 2010 The MITRE Corporation, © 2003 New York Times, © 2003 Xinhua News Agency, © 2003, 2005, 2006, 2008, 2011 Trustees of the University of Pennsylvania