BioProp Version 1.0


Item Name: BioProp Version 1.0
Authors: Wen-Lian Hsu
LDC Catalog No.: LDC2009T04
ISBN: 1-58563-504-9
Release Date: Aug 18, 2009
Data Type: text
Data Source(s): journal articles
Application(s): language modeling, natural language processing
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2009 members
Non-member Fee: US $300.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: N/A
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Wen-Lian Hsu
2009
BioProp Version 1.0
Linguistic Data Consortium, Philadelphia

Introduction

BioProp Version 1.0 was developed by researchers at Academia Sinica, Taipei, Taiwan. It consists of proposition bank-style annotations for approximately 500 English biomedical journal abstracts. The source abstracts, annotated in accordance with Penn Treebank II guidelines, are contained in the GENIA Treebank (GTB). The GTB was developed at the Tsujii Laboratory at the University of Tokyo.

The purpose of the GENIA Project is to develop tools and resources for automatic information extraction of biomedical information. One result of that work is the GENIA corpus, a collection of 2000 biomedical journal abstracts containing semantic class annotation for biomedical terms, part-of-speech (POS) tags and coreferences. The GTB is a subset of that corpuse. BioProp Version 1.0 adds a proposition bank to the GTB.

Proposition Bank (PropBank) contains annotations of predicate argument structures and semantic roles in a treebank schema in the newswire domain. To construct BioProp Version 1.0, a semantic role labeling (SRL) system trained on PropBank was used to annotate the GTB. SRL, also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as agent and patient, and adjunct arguments, such as time, manner and location. The term "argument" refers to a syntactic constituent of the sentence related to the predicate, and the term "semantic role" refers to the semantic relationship between a sentence's predicate and argument.

To suit the needs in the biomedical domain, the PropBank annotation guidelines were modified to characterize semantic roles as components of biological events. Specifically, thirty verbs were selected according to their frequency of use or importance in biomedical texts. Since targets in information extraction are relations of named entities, only sentences containing protein or gene names were used to count each verb's frequency. Verbs of general usage were filtered out in order to keep the focus on biomedical verbs. Some verbs that do not have a high frequency but play important roles in describing biomedical relations, such as "phosphorylate" and "transactivate," were also selected. The BioProp annotation was based on Levin?s verb classes as defined in the VerbNet lexicon. In VerbNet, the arguments of each verb are represented at the semantic level, and thus have associated semantic roles. However, since some verbs may have different usages in biomedical and newswire texts, it is necessary to customize the framesets of biomedical verbs. After selecting the predicate verbs, a semi-automatic method was used to annotate BioProp. The annotation process consisted of the following steps:

  • Identification of predicate candidates
  • Automatic annotation of the biomedical semantic roles using newswire SRL system
  • Transformation of automatic tagging results into WordFreak format
  • Review by human annotators

Data

BioProp Version 1.0 consists of approximately 150,000 words. Each line in the corpus provides a PAS annotation that can be mapped to a sentence in the GTB.

Samples

91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC 91094881 8 217:222 bind 160:183-ARG1 184:210-C-ARG1 211:216-R-ARG1 223:247-ARG2 217:222-rel 248:275-ARGM-ADV 91094881 9 45:53 suppress 0:13-ARGM-ADV 16:38-ARG0 54:78-ARG1 39:44-ARGM-MOD 45:53-rel 79:105-C-ARG1 106:135-ARGM-LOC 91094881 10 49:56 block 0:8-ARGM-DIS 11:44-ARG1 49:56-rel 57:82-ARG0 83:115-ARGM-LOC 91101115 2 99:108 increase 0:98-ARG1 99:108-rel 109:152-ARGM-CAU 91101115 3 159:163 bind 119:153-ARG1 164:191-ARG2 154:158-R-ARG1 159:163-rel

Content Copyright

Portions 2006-2008 Academia Sinica, 2009 Trustees of the University of Pennsylvania