BioProp Version 1.0

Item Name: BioProp Version 1.0
Author(s): Wen-Lian Hsu
LDC Catalog No.: LDC2009T04
ISBN: 1-58563-504-9
ISLRN: 969-572-383-651-0
Release Date: August 18, 2009
Member Year(s): 2009
DCMI Type(s): Text
Data Source(s): journal articles
Application(s): natural language processing, language modeling
Language(s): English
Language ID(s): eng
License(s): BioProp Version 1.0 Agreement
Online Documentation: LDC2009T04 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Hsu, Wen-Lian. BioProp Version 1.0 LDC2009T04. Web Download. Philadelphia: Linguistic Data Consortium, 2009.

Introduction

BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations for approximately 500 English biomedical journal abstracts. The source abstracts, annotated in accordance with Penn Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at the Tsujii Laboratory at the University of Tokyo.

The purpose of the GENIA Project is to develop tools and resources for automatic information extraction of biomedical information. One result of that work is the GENIA corpus, a collection of 2000 biomedical journal abstracts containing semantic class annotation for biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset of that corpuse. BioProp Version 1.0 adds a proposition bank to the GTB.

Proposition Bank (PropBank) contains annotations of predicate argument structures and semantic roles in a treebank schema in the newswire domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as agent and patient, and adjunct arguments, such as time, manner and location. The term "argument" refers to a syntactic constituent of the sentence related to the predicate, and the term "semantic role" refers to the semantic relationship between a sentence's predicate and argument.

To suit the needs in the biomedical domain, the PropBank annotation guidelines were modified to characterize semantic roles as components of biological events. Specifically, thirty verbs were selected according to their frequency of use or importance in biomedical texts. Since targets in information extraction are relations of named entities, only sentences containing protein or gene names were used to count each verb's frequency. Verbs of general usage were filtered out in order to keep the focus on biomedical verbs. Some verbs that do not have a high frequency but play important roles in describing biomedical relations, such as "phosphorylate" and "transactivate," were also selected. The BioProp annotation was based on Levin?s verb classes as defined in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at the semantic level, and thus have associated semantic roles. However, since some verbs may have different usages in biomedical and newswire texts, it is necessary to customize the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic method was used to annotate BioProp. The annotation process consisted of the following steps:

  • Identification of predicate candidates
  • Automatic annotation of the biomedical semantic roles using newswire SRL system
  • Transformation of automatic tagging results into WordFreak format
  • Review by human annotators

Data

BioProp Version 1.0 consists of approximately 150,000 words. Each line in the corpus provides a PAS annotation that can be mapped to a sentence in the GTB.

Samples

91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC 91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV 91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC 91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC 91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU 91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel

Available Media

View Fees





Login for the applicable fee