English Web Treebank
Item Name: | English Web Treebank |
Author(s): | Ann Bies, Justin Mott, Colin Warner, Seth Kulick |
LDC Catalog No.: | LDC2012T13 |
ISBN: | 1-58563-621-5 |
ISLRN: | 230-396-178-102-3 |
DOI: | https://doi.org/10.35111/m5b6-4m82 |
Release Date: | August 16, 2012 |
Member Year(s): | 2012 |
DCMI Type(s): | Text |
Data Source(s): | weblogs, reviews, question-answers, newsgroups, email |
Project(s): | GALE |
Application(s): | part of speech tagging, parsing, question-answering |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2012T13 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Bies, Ann, et al. English Web Treebank LDC2012T13. Web Download. Philadelphia: Linguistic Data Consortium, 2012. |
Related Works: | View |
Introduction
English Web Treebank was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of over 250,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure and is designed to allow language technology researchers to develop and evaluate the robustness of parsing methods in those web domains.
Data
This release contains 254,830 word-level tokens and 16,624 sentence-level tokens of webtext in 1174 files annotated for sentence- and word-level tokenization, part-of-speech, and syntactic structure. The data is roughly evenly divided across five genres: weblogs, newsgroups, email, reviews, and question-answers. The files were manually annotated following the sentence-level tokenization guidelines for web text and the word-level tokenization guidelines developed for English treebanks in the DARPA GALE project. Only text from the subject line and message body of posts, articles, messages and question-answers were collected and annotated.
Weblogs are interactive web sites that display content as discrete entries or posts and allow viewers to comment on entries and engage in discussions. They are typically managed by individuals and use informal or colloquial language. The weblog data in this release was collected by LDC and covers the period 2003-2006.
Newsgroups are repositories of online discussions pertaining to a topic or interest area. They consist of threads that in turn contain articles with comments and discussion from group users. The newsgroup data in this release was collected by LDC and covers the period 2003-2006.
Email are messages sent to discrete individuals or well defined groups via the TCP-IP Simple Mail Transfer Protocol (SMTP). The email messages in this corpus are a subset of emails sent by Enron Corporation employees during the period 1999-2002. Specifically, those messages are contained in the Enronsent Corpus, a collection of 96,107 email messages from the sent folders of Enron email users which were processed to remove any content not generated by human users.
The reviews in this corpus were gleaned from online reviews of businesses and services on various Google web sites written by individuals. This information was provided to LDC by Google in 2011 the dates of individual reviews are not available.
Question-answers are posts from Yahoo!s community-driven question-answering web site, Yahoo! Answers, where individuals submit and answer questions which may be on any topic. This data was collected in 2011; the dates of individual question-answers were not collected.