BLLIP North American News Text, General Release


Item Name: BLLIP North American News Text, General Release
Authors: David McClosky, Eugene Charniak, and Mark Johnson
LDC Catalog No.: LDC2008T14
ISBN: 1-58563-482-4
Release Date: Aug 19, 2008
Data Type: text
Data Source(s): newswire
Application(s): language modeling, linguistic analysis, machine learning, natural language processing
Language(s): English
Language ID(s): eng
Distribution: 4 DVD
Member fee: $0 for 2008 members
Non-member Fee: US $500.00
Reduced-License Fee: US $500.00
Extra-Copy Fee: US $500.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: David McClosky, Eugene Charniak, and Mark Johnson
2008
BLLIP North American News Text, General Release
Linguistic Data Consortium, Philadelphia

Introduction

Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text, General Release, LDC2008T14, isbn 1-58563-482-4, contains a Penn Treebank-style parsing of approximately 21 million sentences from the North American News Text Corpus (LDC95T21). The North American News Text Corpus consists of English news text from the Los Angeles Times-Washington Post (1994-1997), the New York Times (1994-1996), Reuters News Service (1994-1996) and the Wall Street Journal (1994-1996).

BLLIP North American News Text is released in two versions: BLLIP North American News Text, Complete (LDC2008T13), a members-only corpus that contains sentences from all sources in The North American News Text Corpus; and BLLIP North American News Text, General Release (LDC2008T14), a corpus available to nonmembers that does not include the Wall Street Journal data from The North American News Text Corpus.

To complement the Complete and General Release versions of BLLIP North American News Text, LDC is re-releasing The North American News Text Corpus in two versions. North American News Text, Complete LDC2008T15, the members-only original version, is now available as a 2008 Membership Year corpus. North American News Text, General Release (LDC2008T16) (which does not include news text from the Wall Street Journal), is available to nonmembers for the first time. The directory structures of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases.

Methodology

A key problem in natural language processing is syntactic ambiguity resulting from uncertain relationships between words and their connections to sentence clauses. Sentences that can be constructed with correct syntax in more than one way are ambiguous, and such sentences generate multiple parse trees when they are separated into clauses by parts of speech.

Traditional parsing techniques, such as part-of-speech (POS) tagging, typically achieve a 90% accuracy rate because most sentences are not ambiguous. Resolving ambiguous sentences requires a probabilistic approach. Using the relative frequencies of grammar rules, statistical processing techniques assign probabilities for each clause. These probabilities are then summed up over each complete sentence parse and a probability is assigned for that sentence parse. In that way, the most likely parse can be determined.

The data in this release was parsed into Penn Treebank-style parse trees using a re-ranking parser developed by Eugene Charniak and Mark Johnson. The Charniak and Johnson parser is statistically-based and uses a generative first stage followed by a discriminative second stage. Both stages were trained on the Wall Street Journal data in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42). BLLIP 1987-1989 WSJ Corpus Release 1 (LDC2000T43) contains a complete Treebank-style parsing of that Wall Street Journal material.

In order to produce BLLIP North American News Text, the Charniak-Johnson parser used a simplified context free grammar in the first stage to generate a set of n best parses. Those parses were then pruned by eliminating the parses at the edges of the distribution. In the second stage, a maximum entropy-based parser using a complete grammar was applied. The output trees are ranked in order of probability.

Data

The parses in BLLIP North American News Text include constituency and POS tagging information for each of the 50-best parses of each sentence.

Each file contains a sequence of n-best lists. An n-best list is a list of the top n parses of each sentence with the corresponding parser probability and re-ranker score. Following is an example of a simple n-best list:

50 reute9406_007.0356_13 4.9244 -147.337 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .))) 3.56482 -151.575 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency)))) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))) (. .))) 3.35952 -151.173 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (ADVP (RB first)) (VP (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (NP (DT the) (NN presidency)) (, ,) (NP (NN government) (CC and) (NN parliament)))))))))))) (. .))) 2.67662 -148.374 (S1 (S (NP (PRP He)) (VP (VBD argued) (SBAR (S (NP (DT the) (NN country)) (VP (ADVP (RB first)) (AUX had) (S (VP (TO to) (VP (VB define) (NP (NP (DT the) (NNS institutions)) (PP (IN of) (NP (DT the) (NN presidency) (, ,) (NN government) (CC and) (NN parliament))))))))))) (. .)))

In the above example, the first number ("50") indicates the number of parses. The next token is the article id from the North American News Text Corpus ("reute9406_007.0356"), followed by an underscore, followed by the number of the sentence in the article ("13"). The parses follow; for brevity, only three parses out of the fifty are presented here. Each parse consists of a reranker score (4.9244 for the first parse) and parser log probability (-147.337 for the first parse), a new line, and then the parse tree itself. Parse trees are given in Penn Treebank format. Note that the n-best list is sorted by decreasing reranker scores.

Source material is as follows:

Source Dates Approx. # Words (millions)
Los Angeles Times & Washington Post 1994-1997 52
New York Times 1994-1996 173
Reuters (General and Financial) 1994-1996 85
Wall Street Journal (Not included in General Release) 1994-1996 40

Content Copyright

Portions 1994-1997 Los Angeles Times-Washington Post News Service, Inc., 1994-1996 New York Times, 1994-1996 Reuters America, Inc., 1995-1997, 2008 Trustees of the University of Pennsylvania