OntoNotes Release 4.0

Item Name: OntoNotes Release 4.0
Author(s): Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed El-Bachouti, Robert Belvin, Ann Houston
LDC Catalog No.: LDC2011T03
ISBN: 1-58563-574-X
ISLRN: 272-858-321-100-4
Release Date: February 15, 2011
Member Year(s): 2011
DCMI Type(s): Text
Data Source(s): weblogs, newswire, newsgroups, broadcast news, broadcast conversation
Project(s): GALE
Application(s): information retrieval, information extraction
Language(s): English, Mandarin Chinese, Arabic, Chinese
Language ID(s): eng, cmn, ara, zho
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2011T03 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Weischedel, Ralph, et al. OntoNotes Release 4.0 LDC2011T03. DVD. Philadelphia: Linguistic Data Consortium, 2011.

Introduction

OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number LDC2011T03 and isbn 1-58563-574-X, was developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 4.0 is supported by the Defense Advance Research Project Agency, GALE Program Contract No. HR0011-06-C-0022.

OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

The OntoNotes project builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. 

Data

Documents describing the annotation guidelines and the routines for deriving various views of the data from the database are included in the documentation directory of this release. The annotation is provided both in separate text files for each annotation layer (Treebank, PropBank, word sense, etc.) and in the form of an integrated relational database (ontonotes-v4.0.sql.gz) with a Python API to provide convenient cross-layer access.

Tools

This release includes OntoNotes DB Tool v0.999 beta, the tool used to assemble the database from the original annotation files. It can be found in the directory ontonotes-db-tool-v0.999b. This tool can be used to derive various views of the data from the database, and it provides an API that can implement new queries or views. Licensing information for the OntoNotes DB Tool package is included in its source directory.

Updates

On May 21st, 2013 an update was issued to fix some bracketing errors in the follolwing file (ontonotes-release-4.0/data/files/data/english/annotations/nw/wsj/05/wsj_0560.parse), all corpora ordered after this date will include the update. Please contact ldc@ldc.upenn.edu for more information or to obtain the updated file.

Sponsorship

This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

Samples

 

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee