ACE 2005 Mandarin SpatialML Annotations
Item Name: | ACE 2005 Mandarin SpatialML Annotations |
Author(s): | Xiaoman Wang, Christine Doran, Janet Hitzeman, Inderjeet Mani |
LDC Catalog No.: | LDC2010T09 |
ISBN: | 1-58563-546-4 |
ISLRN: | 951-452-048-245-8 |
DOI: | https://doi.org/10.35111/pkce-3b81 |
Release Date: | May 14, 2010 |
Member Year(s): | 2010 |
DCMI Type(s): | Text |
Data Source(s): | broadcast news |
Project(s): | ACE |
Application(s): | spatial analysis, automatic content extraction |
Language(s): | Mandarin Chinese |
Language ID(s): | cmn |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2010T09 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Wang, Xiaoman, et al. ACE 2005 Mandarin SpatialML Annotations LDC2010T09. Web Download. Philadelphia: Linguistic Data Consortium, 2010. |
Related Works: | View |
Introduction
ACE 2005 Mandarin SpatialML Annotations was developed by researchers at The MITRE Corporation (MITRE). ACE 2005 Mandarin SpatialML Annotations applies SpatialML tags to a subset of the source Mandarin training data in ACE 2005 Multilingual Training Corpus (LDC2006T06). Annotations for entities, relations, and events, which were included in ACE 2005 Multilingual Training Corpus, are not included in the current SpatialML release. For SpatialML markup to ACE 2005 English data, see ACE 2005 English SpatialML Annotations (LDC2008T03).
SpatialML is a mark-up language for representing spatial expressions in natural language documents. SpatialML focuses is on geography and culturally-relevant landmarks, rather than biology, cosmology, geology, or other regions of the spatial language domain. The goal is to allow for better integration of text collections with resources such as databases that provide spatial information about a domain, including gazetteers, physical feature databases and mapping services.
The ACE (Automatic Content Extraction) Program seeks to develop extraction technology to support automatic processing of source language data (in the form of natural text, and as text derived from automatic speech recognition and optical character recognition). This includes classification, filtering, and selection based on the language content of the source data, i.e., based on the meaning conveyed by the data. Thus the ACE program requires the development of technologies that automatically detect and characterize this meaning. The annotation efforts of the ACE program supports the development of automatic content extraction technology to support automatic processing of human language in text form. The kind of information recognized and extracted from text includes entities, values, temporal expressions, relations and events
The SpatialML annotation scheme is intended to emulate earlier progress on time expressions such as TIMEX2, TimeML, and the 2005 ACE guidelines. The main SpatialML tag is the PLACE tag which encodes information about location. The central goal of SpatialML is to map location information in text to data from gazetteers and other databases to the extent possible by defining attributes in the PLACE tag. Therefore, semantic attributes such as country abbreviations, country subdivision and dependent area abbreviations (e.g., US states), and geo-coordinates are used to help establish such a mapping. LINK and PATH tags express relations between places, such as inclusion relations and trajectories of various kinds. Information in the tag along with the tagged location string should be sufficient to uniquely determine the mapping, when such a mapping is possible. This also means that redundant information is not included in the tag. To the extent possible, SpatialML leverages ISO and other standards towards the goal of making the scheme compatible with existing and future corpora. The SpatialML guidelines are compatible with existing guidelines for spatial annotation and existing corpora within the ACE research program.
Data
This corpus consists of a 298-document subset of broadcast material from the ACE 2005 Multilingual Training Corpus (LDC2006T06) that has been tagged by a native Mandarin speaker according to version 2.3 of the SpatialML annotation guidelines, which are included in the documentation for this release.
Updates
No updates have been issued at this time.