Message Understanding Conference 7 Timed (MUC7_T)


Item Name: Message Understanding Conference 7 Timed (MUC7_T)
Authors: Katrin Tomanek, Udo Hahn
LDC Catalog No.: LDC2010T15
ISBN: 1-58563-560-X
Release Date: Sep 17, 2010
Data Type: text
Data Source(s): newswire
Project(s): MUC
Application(s): information extraction
Language(s): English
Distribution: Web Download
Member fee: $0 for 2010 members
Non-member Fee: US $150.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: N/A
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Katrin Tomanek, Udo Hahn
2010
Message Understanding Conference 7 Timed (MUC7_T)
Linguistic Data Consortium, Philadelphia

Introduction

Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data Consortium (LDC) catalog number LDC2010T15 and isbn 1-58563-560-X, was developed by researchers at Jena University Language & Information Engnineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data Consortium, LDC2001T02), which consists of New York Times news stories annotated for use in the Message Understanding Conference 7 (MUC7) evaluation. The series of MUC evaluations in the 1990s focused on emerging information extraction technologies. Further information about NIST's MUC7 evaluation can be found MUC project website.

MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus).

Data

The data is stored in XML format. There is an element anno_example for each annotation example that has the original MUC7 document as text context. The MUC7 document was tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries. The tokenizer is part of the Stanford Parser package which can be obtained from The Stanford Natural Language Processing Group. The following attributes are used for the element anno_example:

Attribute Explanation
anno_time  The time it took to annotate the annotation unit of this annotation example (time in milliseconds).
anno_unit_tokens  All tokens of the annotation unit.
anno_unit_offset  Offsets for the tokens of the annotation unit relative to all tokens in the annotation example.
anno_unit_labels  Labels for the tokens of the annotation unit (these labels are taken from MUC7).
doc_id  ID of the document of the annotation example.
sent_id  ID of the sentence of the annotation example.
anno_unit_id  ID of the unit of the annotation example.
muc7_org_filename  The name of the original MUC7 document from which this annotation example is taken.

Dirctory Structure

The directory structure of the corpus is as follows: data:  This subdirectory contains the MUC7_T data; the data for annotator A and B are in separate folders. For each annotator, there is a version of MUC7_T with CNP-level and with sentence-level annotations. docs: This subdirectory contains detailed documentation as well as publications describing applications of MUC7_T. There is also a small JavaDoc for the Java tools (see the tools subdirectory below). dtd:    This subdirectory contains the Document Type Definition (DTD) for the data files. tools: This subdirectory contains a small Java API which allows users to read the MUC7_T XML data so that each annotation example is represented by a Java object. The API incudes the source code and a jar package. The source code has been tested with Java 1.5 and Java 1.6.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T15.

Samples

The following XML excerpts are representative the data in this corpus:

Content Copyright

Portions © 1996 New York Times, © 2001, 2010 Trustees of the University of Pennsylvania