This README describes the contents of the biomedical proposition bank, BioProp 1.0. Overview ============================================================================== Semantic role labeling (SRL), also called shallow semantic parsing, is a popular semantic analysis technique. In SRL, sentences are represented by one or more predicate-argument structures (PAS), also known as propositions. Each PAS is composed of a predicate (e.g., a verb) and several arguments (e.g., noun phrases) that have different semantic roles, including main arguments such as an agent and a patient, as well as adjunct arguments, such as time, manner, and location. Here, the term argument refers to a syntactic constituent of the sentence related to the predicate; and the term semantic role refers to the semantic relationship between a predicate and an argument of a sentence. For example, the sentence ¡§IL4 and IL13 receptors activate STAT6, STAT3, and STAT5 proteins in the human B cells¡¨ describes a molecular activation process. It can be represented by a PAS in which ¡§activate¡¨ is the predicate, ¡§IL4 and IL13 receptors¡¨ comprise the agent, ¡§STAT6, STAT3, and STAT5 proteins¡¨ comprise the patient, and ¡§in the human B cells¡¨ is the location. Thus, the agent, patient, and location are the arguments of the predicate. We construct a biomedical proposition bank on top of GENIA Treebank. GENIA is a corpus which has a collection of 2,000 MEDLINE abstracts annotated with various levels of linguistic information, such as parts-of-speech and named entities. The GENIA corpus annotated with full parsing information is called GENIA Treebank (GTB), which contains 500 abstracts. In order to construct our biomedical proposition bank, we firstly employed the rich resource of PropBank in general English domain to build a SRL system then used it to automatically annotate the semantic roles of sentences in GTB to construct a biomedical proposition bank with hand modified, BioProp. It contains analyzed 1,962 PAS¡¦s for over 30 biomedical verbs that are frequently used or considered important for describing molecular events. This file is available to those who have a license for the GENIA Treebank Corpus. Publication Name ============================================================================== biomedical proposition bank, BioProp 1.0 Authors ============================================================================== Wen-Lian Hsu Data Type =============================================================================== The data is encoded as a text file that provides SRL annotations to GENIA treebank corpus. For more information on the GENIA treebank corpus please visit http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA. Data Sources =============================================================================== In 2006, we adopted GENIA as the underlying corpus to construct BioProp. GENIA is a collection of 2,000 MEDLINE abstracts selected from the search results for queries using the keywords ¡§human¡¨, ¡§blood cells¡¨, or ¡§transcription factors¡¨. In its officially released version, it is annotated with various levels of linguistic information, such as parts-of-speech, named entities, and conjunctions. In the summer of 2005, Tateisi published full parsing information for the corpus that basically follows the Penn Treebank II (PTB) annotation scheme encoded in XML. The GENIA corpus annotated with full parsing information is called GENIA Treebank (GTB). Currently, GTB is a beta version containing 500 abstracts. BioProp provides the annotations of semantic roles for GTB. Project =============================================================================== The project name is BIOmedical SeMantIc roLe labeler (BIOSMILE). Applications =============================================================================== An annotated corpus and a PAS standard are essential for the construction of a biomedical SRL system. Users can use BioProp corpus to construct a machine-learning model system for extracting biomedical relations, which can approach the questions about information extraction, information retrieval, metadata extraction, summarization, automatic content extraction, and information detection. Languages =============================================================================== All articles in this collection are written in English. Grant Number and Funding Agency =============================================================================== This research was supported in part by the National Science Council under grant NSC97-3112-B-001-011 and NSC97-2218-E-155-001 as well as the thematic program of Academia Sinica under grant AS95ASIA02. Copyright =============================================================================== Wen-Lian Hsu holds the Copyright to all data in this corpus. Portions c 2006 - 2008 Academia Sinica Data Description =============================================================================== The file size of BioProp 1.0 is 147,791 bytes with the total numbers of 150,000 words. BioProp 1.0 was constructed on top of GENIA Treebank (GTB), provided with BioProp annotation only. Users require a license to the GTB corpus for obtaining a copy of GTB corpus. Each line in BioProp 1.0 provides a PAS annotation which can be mapped to a sentence in GENIA treebank corpus. For example, consider the following data from BioProp 1.0: 91079577 4 74:82 induce 0:65-ARG0 74:82-rel 83:99-ARG1 100:113-ARGM-LOC 91094881 3 142:152 stimulate 0:46-ARG0 49:139-ARGM-TMP 142:152-rel 153:166-ARG1 167:217-ARGM-LOC 91094881 6 88:98 stimulate 0:55-ARGM-ADV 58:87-ARG0 88:98-rel 99:112-ARG1 113:168-ARGM-LOC This is a format example of BioProp 1.0. Note that each line in BioProp 1.0 can be seperated into five fields. First field: GENIA Treebank (GTB) file ID Second field: The sentence number in GTB file - starting with 1. Third field: The verb position in a sentence. The number indicates the position of the verb in the corresponding sentence. Fourth field: Base form of the verb After the fifth fields: Each item represents a semantic role of the BioProp PAS. Each item consists of a pointer and a semantic role label separated by a hyphen. A pointer is represented by two numbers which are separated by a colon. The first number indicates the starting index of the sentence in GTB (please note that not in GTB tree format). The second number indicates a end index. For example, the phrase "They bind to the kappa B motifs", where "They" is represented as 0:4. The phrase, "to the kappa B motifs", is the ARG2 of the verb "bind". It is represented as "10:31-ARG2". For a complete description of the data used in this corpus, please refer to the file name 'BioProp.pdf' included with this collection.