Catalan TimeBank 1.0

Item Name: Catalan TimeBank 1.0
Author(s): Roser Sauri, Toni Badia
LDC Catalog No.: LDC2012T10
ISBN: 1-58563-618-5
ISLRN: 442-580-062-511-9
Release Date: July 18, 2012
Member Year(s): 2012
DCMI Type(s): Text
Data Source(s): news magazine, newswire
Application(s): temporal analysis, information extraction
Language(s): Catalan
Language ID(s): cat
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2012T10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Sauri, Roser, and Toni Badia. Catalan TimeBank 1.0 LDC2012T10. Web Download. Philadelphia: Linguistic Data Consortium, 2012.

Catalan TimeBank 1.0 was developed by researchers at Barcelona Media and consists of Catalan texts in the AnCora corpus annotated with temporal and event information according to the TimeML specification language.

TimeML (Pusteyovsky, et al., 2005) is a schema for annotationg eventualities and time expressions in natural language as well as the temporal relations among them, thus facilitating the task of extraction, representation and exchange of temporal information. Catalan Timebank 1.0 is annotated in three levels, marking events, time expressions and event metadata. The TimeML annotation scheme was tailored for the specifics of the Catalan language. Temporal relations in Catalan present distinctions of verbal mood (e.g., indicative, subjunctive, conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent in English. Catalan TimeBank 1.0 joins the family of TimeBank annotated corpora which includes languages such as English, Spanish, Italian, French, Korean and Chinese. Through their common layer of annotation, these corpora provide resoures useful for multilingual temporal extraction and processing, such as multilingual text entailment, opinion mining or question answering.

LDC has released the following corpora incorporating TimeBank annotation: TimeBank 1.2 LDC2006T08, FactBank 1.0 LDC2009T23 and ModeS TimeBank 1.0 LDC2012T01.

Data

Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding punctuation). The source documents are from the EFE news agency, the ACN Catalan news agency2 and the Catalan version of the El Períodico newspaper, and span the period from January to December 2000.

The AnCora corpus is the largest multilayer annotated corpus of Spanish and Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan. The AnCora documents are annotated on many linguistic levels including stucture, syntax, dependencies, semantics and pragmatics.That information is not included in this release, but it can be mapped to the present annotations. The data contained in the AnCora corpus has been used in several international natural language processing evaluations such as CoNLL-2006, CoNLL-2007 and SemEval-2007. The corpus is freely available from the Centre de Llenguatge i Computació (CLiC).

Samples

(Click to view full sized image.)

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2012T10.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee