Unified Linguistic Annotation Text Collection

Item Name: Unified Linguistic Annotation Text Collection
Author(s): Linguistic Data Consortium
LDC Catalog No.: LDC2009T07
ISBN: 1-58563-511-1
ISLRN: 369-443-379-033-6
DOI: https://doi.org/10.35111/gh95-sk17
Release Date: March 17, 2009
Member Year(s): 2009
DCMI Type(s): Text
Application(s): summarization, sociolinguistics, question-answering, psycholinguistics, pragmatics, information retrieval
Language(s): English, Mandarin Chinese, Standard Arabic, Arabic
Language ID(s): eng, cmn, arb, ara
License(s): LDC User Agreement for Non-Members
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium. Unified Linguistic Annotation Text Collection LDC2009T07. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: View

Introduction

Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11).

Most recent annotation efforts for language have focused on small pieces of the larger problem of semantic annotation rather than producing a single unified representation. The Unified Linguistic Annotation (ULA) project, sponsored by the National Science Foundation, seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The project represents a concerted effort of researchers from several institutions to develop a large word corpus with balanced and annotated data. The ULA Text Collection is provided as a resource for the ULA effort. It consists of two datasets, the Language Understanding Annotation Corpus from the Johns Hopkins Center of Excellence in Human Language Technology and ACE Reflex Entity Translation Training Dev/Test developed by LDC.

The Language Understanding Annotation Corpus (LDC2009T10). The Language Understanding Annotation Corpus consists of over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. The materials were chosen from various sources to represent "informal input," that is, text that contains colloquial forms. The documents in the corpus include excerpts from newswire stories, telephone conversation transcripts, emails, contracts and written instructions.

REFLEX Entity Translation Training/DevTest (LDC2009T11). REFLEX Entity Translation Training/DevTest is the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST). It contains approximately 67.5k words of newswire and weblog text for each of English, Chinese and Arabic (or approximately22.5k words in each language) translated ito each of the other two languages. The data is annotated for entities and TIMEX2 extents and normalization.

Samples

Please view this LDC2009T10 sample and LDC2009T11 sample.

Available Media

View Fees





Login for the applicable fee