English Web Treebank

Item Name: English Web Treebank
Author(s): Ann Bies, Justin Mott, Colin Warner, Seth Kulick
LDC Catalog No.: LDC2012T13
ISBN: 1-58563-621-5
ISLRN: 230-396-178-102-3
Release Date: August 16, 2012
Member Year(s): 2012
DCMI Type(s): Text
Data Source(s): weblogs, reviews, question-answers, newsgroups, email
Project(s): GALE
Application(s): part of speech tagging, parsing, natural language processing, question-answering
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2012T13 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Bies, Ann, et al. English Web Treebank LDC2012T13. Web Download. Philadelphia: Linguistic Data Consortium, 2012.

Introduction

English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.

Data

This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.

Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.

Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.

Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.

The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.

Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected.

Samples

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee