ACE 2005 Mandarin SpatialML Annotations


Item Name: ACE 2005 Mandarin SpatialML Annotations
Authors: Xiaoman Wang, Christy Doran, Janet Hitzeman, Inderjeet Mani
LDC Catalog No.: LDC2010T09
ISBN: 1-58563-546-4
Release Date: May 14, 2010
Data Type: text
Data Source(s): broadcast news
Project(s): ACE
Application(s): automatic content extraction, spatial analysis
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: Web Download
Member fee: $0 for 2010 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Xiaoman Wang, et al.
2010
ACE 2005 Mandarin SpatialML Annotations
Linguistic Data Consortium, Philadelphia

Introduction

ACE 2005 Mandarin SpatialML Annotations was developed by researchers at The MITRE Corporation (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags to a subset of the source Mandarin training data in ACE 2005 Multilingual Training Corpus (LDC2006T06). Annotations for entities, relations, and events, which were included in ACE 2005 Multilingual Training Corpus, are not included in the current SpatialML release. For SpatialML markup to ACE 2005 English data, see ACE 2005 English SpatialML Annotations (LDC2008T03).

SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services.

The ACE (Automatic Content Extraction) Program seeks to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from automatic speech recognition and optical character recognition). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The annotation efforts of the ACE program supports the development of automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events

The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML, and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag which encodes information about location. The central goal of SpatialML is to map location information in text to data from gazetteers and other databases to the extent possible by defining attributes in the PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program.

Data

This corpus consists of a 298-document subset of broadcast material from the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been tagged by a native Mandarin speaker according to version 2.3 of the SpatialML annotation guidelines, which are included in the documentation for this release.

Updates

No updates have been issued at this time.

Content Copyright

Portions 2000-2001 China Broadcasting System, 2000-2001 China Central TV, 2000-2001 China National Radio, 2000-2001 China Television System, 2008-2009 The MITRE Corporation, 2005, 2006, 2010 Trustees of the University of Pennsylvania