ACE 2005 English SpatialML Annotations

Item Name: ACE 2005 English SpatialML Annotations
Author(s): Inderjeet Mani, Janet Hitzeman, Justin Richer, David Harris
LDC Catalog No.: LDC2008T03
ISBN: 1-58563-458-1
ISLRN: 472-226-418-389-7
Release Date: January 22, 2008
Member Year(s): 2008
DCMI Type(s): Text
Data Source(s): broadcast news, newswire, broadcast conversation
Project(s): ACE
Application(s): spatial analysis, automatic content extraction
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2008T03 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mani, Inderjeet, et al. ACE 2005 English SpatialML Annotations LDC2008T03. Web Download. Philadelphia: Linguistic Data Consortium, 2008.

Introduction

The ACE (Automatic Content Extraction) program focuses on developing automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events. SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML's focus is primarily on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for potentially better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services. In ACE 2005 English SpatialML Annotations, the authors applied SpatialML tags to the English training data (originally annotated for entities, relations and events) in ACE 2005 Multilingual Training Corpus, LDC2006T06. (NOTE: 2005 ACE training data and evaluation data were distributed as e-corpora (LDC2005E18, LDC2005E23) to participants in the 2005 ACE evaluation. Some of the files in ACE 2005 English SpatialML Annotations may originate from one of those e-corpora, not from LDC2006T06).

The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML and the 2005 ACE guidelines.

The main SpatialML tag is the PLACE tag. The central goal of SpatialML is to map PLACE information in text to data from gazetteers and other databases to the extent possible. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag.

To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program. In particular, the English Annotation Guidelines for Entities (Version 5.6.6 2006.08.01) were exploited, specifically the GPE, Location, and Facility entity tags, and the Physical relation tags, all of which are mapped to SpatialML tags. Ideas were also borrowed from Toponym Resolution Markup Language of Leidner (2006), the research of Schilder et al. (2004) and the annotation scheme in Garbin and Mani (2005). Information recorded in the annotation is compatible with the feature types in the Alexandria Digital Library. This corpus also leverages the integrated gazetteer database (IGDB) of Mardis and Burger (2005). Last but not least, this annotation scheme can be related to the Geography Markup Language (GML) defined by the Open Geospatial Consortium (OGC), as well as Google Earth's Keyhole Markup Language (KML), to express geographical features.

SpatialML goes beyond these schemes, however, in terms of providing a richer markup for natural language that includes semantic features and relationships that allow mapping to existing resources such as gazetteers. Such a markup can be useful for (i) disambiguation, (ii) integration with mapping services, and (iii) spatial reasoning. In relation to (iii), it is possible to use spatial reasoning not only for integration with applications, but for better information extraction, e.g., for disambiguating a place name based on the locations of other place names in the document. SpatialML goes to some length to represent topological relationships among places, derived from the RCC8 Calculus (Randell et al. 1992, Cohn et al. 1997).

Addtional information about SpatialML is contained in the paper "SpatialML: Annotation Scheme for Marking Spatial Expressions in Natural Lanugage," which is included in the online documentation for this corpus.

Please direct all questions about this corpus to Janet Hitzeman (hitz@mitre.org)

Samples

For an example of the data in the corpus, please examine this sample.

Available Media

View Fees





Login for the applicable fee