Annotated English Gigaword


Item Name: Annotated English Gigaword
Authors: Courtney Napoles, Matthew R. Gormley, Benjamin Van Durme
LDC Catalog No.: LDC2012T21
ISBN: 1-58563-629-0
Release Date: Nov 15, 2012
Data Type: text
Data Source(s): newswire
Project(s): GALE
Application(s): information extraction, information retrieval, language modeling, parsing
Language(s): English
Language ID(s): eng
Distribution: 1 Hard Disk
Member fee: $0 for 2012 members
Non-member Fee: US $6000.00
Reduced-License Fee: US $3000.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Courtney Napoles, Matthew R. Gormley, Benjamin Van Durme
2012
Annotated English Gigaword
Linguistic Data Consortium, Philadelphia

Introduction

Annotated English Gigaword was developed by Johns Hopkins Universitys Human Language Technology Center of Excellence. It adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the datasets XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers.

Data

Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources:

  • Agence France-Presse, English Service (afp_eng)
  • Associated Press Worldstream, English Service (apw_eng)
  • Central News Agency of Taiwan, English Service (cna_eng)
  • Los Angeles Times/Washington Post Newswire Service (ltw_eng)
  • Washington Post/Bloomberg Newswire Service (wpb_eng)
  • New York Times Newswire Service (nyt_eng)
  • Xinhua News Agency, English Service (xin_eng)

The following layers of annotation were added:

  • Tokenized and segmented sentences
  • Treebank-style constituent parse trees
  • Syntactic dependency trees
  • Named entities
  • In-document coreference chains

The annotation was performed in a three-step process: (1) the data was preprocessed and sentences selected for annotation (sentences with more than 100 tokens were excluded) (2) syntactic parses were derived and (3) the parsed output was post-processed to derive syntactic dependencies, named entities and coreference chains. Over 183 million sentences were parsed.

The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.

Samples

Please the link for a sample.

Additional Licensing Information

Any 2011 member organization that licensed English Gigaword Fifth Edition (LDC2011T07) may request a no-cost copy of Annotated English Gigaword. Any non-member organization that licensed English Gigaword Fifth Edition may request a copy of Annotated English Gigaword for a $200 media fee. Please contact ldc@ldc.upenn.edu for licensing or with any additional questions.

Updates

None at this time.

Content Copyright

Portions 1994-2010 Agence France Presse, 1994-2010 The Associated Press, 1997-2010 Central News Agency (Taiwan), 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., 1994-2010 New York Times, 2010 The Washington Post News Service with Bloomberg News, 1995-2010 Xinhua News Agency, 2012 Matthew R. Gormley, 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania