Message Understanding Conference 7 Timed (MUC7_T)
|Item Name:||Message Understanding Conference 7 Timed (MUC7_T)|
|Author(s):||Katrin Tomanek, Udo Hahn|
|LDC Catalog No.:||LDC2010T15|
|Release Date:||September 17, 2010|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2010T15 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Tomanek, Katrin, and Udo Hahn. Message Understanding Conference 7 Timed (MUC7_T) LDC2010T15. Web Download. Philadelphia: Linguistic Data Consortium, 2010.|
Message Understanding Conference 7 Timed (MUC7_T), Linguistic Data Consortium (LDC) catalog number LDC2010T15 and isbn 1-58563-560-X, was developed by researchers at Jena University Language & Information Engnineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany. It is a re-annotation of a portion of the MUC7 corpus (Linguistic Data Consortium, LDC2001T02), which consists of New York Times news stories annotated for use in the Message Understanding Conference 7 (MUC7) evaluation. The series of MUC evaluations in the 1990s focused on emerging information extraction technologies. Further information about NIST's MUC7 evaluation can be found MUC project website.
MUC7_T consists of 100 articles from the MUC7 corpus training set reannotated for named entities (persons, locations and organizations) with a time stamp indicating the time measured for the linguistic decision making process. The corpus was developed for two principal purposes: for use in evaluations of selective sampling strategies, such as Active Learning; and to create predictive models for annotation costs. The annotation was performed by two advanced students of linguistics with good English language skills who followed the the original guidelines of the MUC7 named entity task (which can be found in the online documentation for the MUC7 corpus).
The data is stored in XML format. There is an element anno_example for each annotation example that has the original MUC7 document as text context. The MUC7 document was tokenized using the Stanford Tokenizer3 with white spaces marking token boundaries. The tokenizer is part of the Stanford Parser package which can be obtained from The Stanford Natural Language Processing Group. The following attributes are used for the element anno_example:
|anno_time||The time it took to annotate the annotation unit of this annotation example (time in milliseconds).|
|anno_unit_tokens||All tokens of the annotation unit.|
|anno_unit_offset||Offsets for the tokens of the annotation unit relative to all tokens in the annotation example.|
|anno_unit_labels||Labels for the tokens of the annotation unit (these labels are taken from MUC7).|
|doc_id||ID of the document of the annotation example.|
|sent_id||ID of the sentence of the annotation example.|
|anno_unit_id||ID of the unit of the annotation example.|
|muc7_org_filename||The name of the original MUC7 document from which this annotation example is taken.|
The directory structure of the corpus is as follows: data: This subdirectory contains the MUC7_T data; the data for annotator A and B are in separate folders. For each annotator, there is a version of MUC7_T with CNP-level and with sentence-level annotations. docs: This subdirectory contains detailed documentation as well as publications describing applications of MUC7_T. There is also a small JavaDoc for the Java tools (see the tools subdirectory below). dtd: This subdirectory contains the Document Type Definition (DTD) for the data files. tools: This subdirectory contains a small Java API which allows users to read the MUC7_T XML data so that each annotation example is represented by a Java object. The API incudes the source code and a jar package. The source code has been tested with Java 1.5 and Java 1.6.
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T15.
The following XML excerpts are representative the data in this corpus: