Concretely Annotated New York Times
Item Name: | Concretely Annotated New York Times |
Author(s): | Francis Ferraro, Max Thomas, Travis Wolfe, Matthew R. Gormley, Craig Harman, Benjamin Van Durme |
LDC Catalog No.: | LDC2018T12 |
ISBN: | 1-58563-840-4 |
ISLRN: | 504-151-596-424-6 |
DOI: | https://doi.org/10.35111/xgs8-5140 |
Release Date: | April 16, 2018 |
Member Year(s): | 2018 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Application(s): | coreference resolution, event detection, information extraction, information retrieval, language modeling, parsing |
Language(s): | English |
Language ID(s): | eng |
License(s): |
Concretely Annotated New York Times Agreement |
Online Documentation: | LDC2018T12 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Ferraro, Francis, et al. Concretely Annotated New York Times LDC2018T12. Web Download. Philadelphia: Linguistic Data Consortium, 2018. |
Related Works: | View |
Introduction
Concretely Annotated New York Times was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds multiple kinds and instances of automatically-generated syntactic, semantic and coreference annotations to The New York Times Annotated Corpus (LDC2008T19).
Concrete is a schema for representing structured, hierarchical and overlapping linguistic annotations. This release provides multiple tool outputs producing the same annotation types as different annotation theories under a shared tokenization.
Data
Concretely Annotated New York Times contains all of the 1.8 million articles in The New York Times Annotated Corpus. Those articles were written and published by the New York Times between January 1, 1987 and June 19, 2007; the 2008 corpus also includes metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.
The following layers of annotation were added by processing the articles under the Concrete schema:
- Segmented sentences and Penn Treebank-style tokenized words
- Treebank-style constituent parse trees
- Four different syntactic dependency trees
- Named entities
- Part of speech tags
- Lemmas
- In-document entity coreference chains
- Three different frame semantic parses
See analytics.pdf for the list of tools used to create those annotations.
The data is stored in a binary form called Concrete, which is based on Apache Thrift. Concrete can be read and written in many common programming languages, such as Java, Python, Javascript and C++. Concrete also includes a number of utilities to access and view the data in human-readable forms.
The original NITF (News Industry Text Format) document structure in The New York Times Annotated Corpus was preserved in this Concrete version.
Samples
Please view this concrete sample.
Reference
Users of this corpus must cite the following paper :
Francis Ferraro, Max Thomas, Matthew Gormley, Travis Wolfe, Craig Harman, and Benjamin Van Durme. "Concretely Annotated Corpora." In The Proceedings of the NIPS Workshop on Automated Knowledge Base Construction (AKBC). NIPS Workshop 2014.
Additional Licensing Instructions
Any organization that licensed The New York Times Annotated Corpus (LDC2008T19) may request a copy of Concretely Annotated New York Times (LDC2018T12) for a $150 fee. Contact ldc@ldc.upenn.edu for licensing.