This is release 1.0 of the Oncology corpus of PennBioIE, the Biomedical
Information Extraction Project at the University of Pennsylvania, supported
by award EIA-0205448 from the National Science Foundation's Information
Technology Research program, with assistance in specific areas from The Pew
Cardiac Trusts and the David Lawrence Altschuler Chair in Genomics and
Computational Biology.  This release also includes the v0.9 release
of December 2004. 

The purpose of this project is to provide material for the development of
better methods for information extraction from biomedical free text. To
that end we have annotated PubMed abstracts in two biomedical domains:

inhibition of the cytochrome P450 family of enzymes:
    name:        CYP450
    short name:  cyp
    abstracts:   1100 
    approx.wds:  274,000

cancer, concentrating on molecular genetics:
    name:        oncology 
    short name:  onco 
    abstracts:   1414 
    approx.wds:  327,000

In addition, 642 abstracts (324 cyp, 318 onco) are also syntactically
annotated (treebanked), and 601 abstracts (oncology only) have been
annotated for relations between entities that are part of a single genetic
variation.

The texts are annotated at the following layers:

 -  Paragraph
 -  Sentence
 -  Biomedical entity
 -  Token and part of speech
 -  Syntax (treebanking) (some texts only)
 -  Semantic relations (some oncology texts only) 


The project and the Oncology corpus are described in more detail in
index.html and in data/data.html.