ACE 2005 English SpatialML Annotations


Item Name: ACE 2005 English SpatialML Annotations
Authors: Inderjeet Mani, Janet Hitzeman, Justin Richer, David Harris
LDC Catalog No.: LDC2008T03
ISBN: 1-58563-458-1
Release Date: Jan 22, 2008
Data Type: text
Data Source(s): broadcast conversation, broadcast news, newswire
Project(s): ACE
Application(s): automatic content extraction, spatial analysis
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2008 members
Non-member Fee: US $1000.00
Reduced-License Fee: US $500.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Inderjeet Mani, et al.
2008
ACE 2005 English SpatialML Annotations
Linguistic Data Consortium, Philadelphia

Introduction

The ACE (Automatic Content Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06).

The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML and the 2005 ACE guidelines.

The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to map PLACE information in text to data from gazetteers and other databases to the extent possible. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag.

To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program. In particular, the English Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location, and Facility entity tags, and the Physical relation tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information recorded in the annotation is compatible with the feature types in the Alexandria Digital Library. This corpus also leverages the integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme can be related to the Geography Markup Language (GML) defined by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to express geographical features.

SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible to use spatial reasoning not only for integration with applications, but for better information extraction, e.g., for disambiguating a place name based on the locations of other place names in the document. SpatialML goes to some length to represent topological relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn et al. 1997).

Addtional information about SpatialML is contained in the paper "SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included in the online documentation for this corpus.

Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org)

Samples

For an example of the data in the corpus, please examine this sample.

Content Copyright

Portions 2003 Agence France-Presse, 2003 The Associated Press, 2003 Cable News Network, LP, LLLP, 2007 The MITRE Corporation, 2003 New York Times, 2003 Xinhua News Agency, 2003, 2005, 2006, 2008 Trustees of the University of Pennsylvania