English Web Treebank


Item Name: English Web Treebank
Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick
LDC Catalog No.: LDC2012T13
ISBN: 1-58563-621-5
Release Date: Aug 16, 2012
Data Type: text
Data Source(s): email, newsgroups, question-answers, reviews, weblogs
Project(s): GALE
Application(s): natural language processing, parsing, part of speech tagging, question-answering
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2012 members
Non-member Fee: US $175.00
Reduced-License Fee: US $175.00
Extra-Copy Fee: US $175.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ann Bies, et al.
2012
English Web Treebank
Linguistic Data Consortium, Philadelphia

Introduction

English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.

Data

This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.

Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.

Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.

Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.

The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.

Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011 the dates of individual question-answers were not collected.

Samples

Content Copyright

Portions 2012 Google Inc., 2011 Yahoo! Inc., 2012 Trustees of the University of Pennsylvania