FactBank 1.0

Item Name: FactBank 1.0
Authors: Roser Sauri, James Pustejovsky
LDC Catalog No.: LDC2009T23
ISBN: 1-58563-522-7
Release Date: Sep 15, 2009
Data Type: text
Data Source(s): newswire
Application(s): question-answering, temporal analysis
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2009 members
Non-member Fee: US $0.00
Reduced-License Fee: US $0.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Roser Sauri, James Pustejovsky
FactBank 1.0
Linguistic Data Consortium, Philadelphia


FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire and broadcast news reports in which event mentions are annotated with their degree of factuality, that is, the degree to which they correspond to those events. FactBank 1.0 was built on top of TimeBank 1.2 and a fragment of the AQUAINT TimeML Corpus, both of which used the TimeML specification language. This resulted in a double-layered annotation of event factuality. TimeBank 1.2 and AQUAINT TimeML encode most of the basic structural elements expressing factuality information while FactBank 1.0 represents the resulting factuality interpretation. The combination of the factuality values in FactBank with the structural information in TimeML-annotated corpora facilitates the development of tools aimed at automatically identifying the factuality values of events, a component fundamental in tasks requiring some degree of text understanding, such as Textual Entailment, Question Answering, or Narrative Understanding.

FactBank annotations indicate whether the event mention describes actual situations in the world, situations that have not happened, or situations of uncertain interpretation. Event factuality is not an inherent feature of events but a matter of perspective. Different discourse participants may present divergent views about the factuality of the very same event. Consequently, in FactBank, the factuality degree of events is assigned relative to the relevant sources at play. In this way, it can adequately reflect the divergence of opinions regarding the factual status of events, as is common in news reports.

The annotation language is grounded on established linguistic analyses of the phenomenon, which facilitated the creation of a battery of discriminatory tests for distinguishing between factuality values. Furthermore, the annotation procedure was carefully designed and divided into basic, sequential annotation tasks. This made it possible for hard tasks to be built on top of simpler ones, while at the same time allowing annotators to become incrementally familiar with the complexity of the problem. As a result, FactBank annotation achieved a relatively high interannotation agreement, kappa=0.81, a positive result when considered against similar annotation efforts.


All FactBank markup is standoff and is represented through a set of 20 tables which can be easily loaded into a database. Each table resides in an independent text file, where fields are separated by three consecutive bars (i.e., |||). The data in fields of string type are presented between simple quotations (').

Because FactBank 1.0 was built on top of TimeBank 1.2 and AQUAINT TimeML, both of which are marked up with inline XML-based annotation, this release contains the TimeBank 1.2 and AQUAINT TimeML annotation in standoff, table-based format as well.


Content Copyright

Portions 1998 American Broadcasting Corporation, 1998 The Associated Press, 1998 Cable News Network, LP, LLLP, 1987-1989 Dow Jones & Company, Inc., 1998 New York Times, 1998 Public Radio International, 1998, 1999 Xinhua News Agency, 2002-2009 Brandeis University, 2009 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.