PennBioIE CYP 1.0
|Item Name:||PennBioIE CYP 1.0|
|Author(s):||Mark Liberman, Mark Mandel, GlaxoSmithKline Pharmaceuticals R&D|
|LDC Catalog No.:||LDC2008T20|
|Release Date:||November 18, 2008|
|Data Source(s):||journal articles|
|Application(s):||information retrieval, information extraction, information detection, automatic content extraction|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2008T20 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Liberman, Mark, Mark Mandel, and GlaxoSmithKline Pharmaceuticals R&D. PennBioIE CYP 1.0 LDC2008T20. Web Download. Philadelphia: Linguistic Data Consortium, 2008.|
The PennBioIE CYP Corpus consists of 1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes, comprising approximately 274,000 words of biomedical text, tokenized and annotated for paragraph, sentence, part of speech, and five types of biomedical named entities in three categories of interest. 324 of the abstracts have also been syntactically annotated.
All of the annotation was based on Penn Treebank II standards, with some modifications for special characteristics of the biomedical text. The entity definitions were developed and revised in an extensive process of interaction between domain experts and biomedically trained annotators.
The data was prepared by the Linguistic Data Consortium for the Institute for Research in Cognitive Science, with funding from the National Science Foundation under Grant No. ITR EIA-0205448, Information Technology Research (ITR) program, in collaboration with GlaxoSmithKline Pharmaceuticals R&D.
The corpus contains 1100 PubMed abstracts comprising approximately 313,000 total words of text. Each file has been tokenized and its biomedical portions (274,000 words) exhaustively annotated for paragraph, sentence, and part of speech, and non-exhaustively annotated for 5 types of named entity. Each token has a part-of-speech tag.
Tokens and POS tags: Tokens in biomedical and chemical notation and terms, and spelled-out numbers, may contain whitespace and/or punctuation ("beta, 20 diol", "(Na+ + K+)ATPase", "two hundred seven"); and named entity mentions may comprise several tokens ("polychlorinated biphenyl preparations"). Tokens and entities do not span sentence boundaries.
Biomedical and non-biomedical text: The title and body of each abstract are considered to be biomedical text, and the automatic and manual annotations in them have been extensively curated. Everything else, such as citation information and author names, is considered non-biomedical; this has not been entity annotated, and its automated tokenization and part of speech tags have not been curated and are known to be unreliable. In non-biomedical text, the tag "section" is used instead of "sentence", allowing users to include or exclude these parts. There are approximately 327,000 words of biomedical text and 39,000 words of non-biomedical text.
Principles and Methods
Many annotation projects start with an already annotated corpus, such as the Penn Treebank or the Brown Corpus, which is treated as unchangeable. As a result, annotation practices have sometimes involved compromises which might not have been necessary if the earlier annotation had been able to integrate the requirements of the later work. Such integration is necessary here because of the scope of this project, involving highly technical biomedical texts, entity definitions driven by the needs of biomedical research, and the goal of making the annotation layers work together as much as possible, e.g., using entity information in the treebank annotation of prenominal modifiers. Such integration is also possible given the relatively long term of the grant (five years) and because researchers were starting with fresh text, applying all layers of annotation themselves.
The texts are annotated at the following layers:
- Biomedical entity
- Token and part of speech
- Syntax (treebanking) (some texts only)
- Semantic relations
Paragraph, sentence, tokenization, POS, and syntactic annotation (treebanking) are applied by automatic taggers and manually corrected ; entity annotation is manual. The authors originally used a POS tagger trained on Penn Treebank data, which made many errors on the very different text of these biomedical abstracts. When there was enough manually-corrected data to train a tagger, overall accuracy rose from 88.53% to 97.33% (Kulick et al. 2004 (slides)).
Annotation at all layers except entity is based on the Penn Treebank II guidelines, with a number of modifications that have been found necessary, many of which were subsequently adopted by the Penn Treebank. Entity definitions came originally from domain experts and were developed and refined in dialogue with the annotators. All annotation is standoff: the source text is never modified, annotations being made in a separate file.
For an example of the data contained in this corpus, please examine this page containing examples of the source text, the standoff annotations, tokenization, treebank, and interactive HTML view.