News Sub-domain Named Entity Recognition

Item Name: News Sub-domain Named Entity Recognition
Author(s): Oshin Agarwal, Ani Nenkova
LDC Catalog No.: LDC2023T12
ISLRN: 999-199-891-265-9
DOI: https://doi.org/10.35111/msjw-9j39
Release Date: November 15, 2023
Member Year(s): 2023
DCMI Type(s): Text
Data Source(s): newswire
Application(s): information extraction, information retrieval, named entity recognition
Language(s): English
Language ID(s): eng
License(s): News Sub-domain Named Entity Recognition
Online Documentation: LDC2023T12 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Agarwal, Oshin, and Ani Nenkova. News Sub-domain Named Entity Recognition LDC2023T12. Web Download. Philadelphia: Linguistic Data Consortium, 2023.
Related Works: View

Introduction

News Sub-domain Named Entity Recognition (LDC2023T12) was developed at the University of Pennsylvania and contains over 20,000 English news sentences annotated with named entities and categorized into sub-domains. The sentences were extracted from The New York Times Annotated Corpus (LDC2008T19), which is comprised of over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com.

Data

Sentences were selected from different years and topics following the metadata provided in the New York Times corpus above. Named entity annotation was based on the CoNLL-2003 guidelines and annotation scheme. Sentences were labeled with person (PER), location (LOC) and organization (ORG) tags using phrase matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, Metropolitan, Sports and Others. "Others" includes topics such as Real Estate, New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness.

Each line in the annotation files (except the document id) contains two columns separated by tabs: the first column contains the word, and the second column contains the tag. Following CoNLL guidelines, tags are B-TYPE, I-TYPE and O. TYPE can be PER, LOC or ORG.

Annotation and source text files are presented in txt format.

Samples

TXT file

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee