This is the readme file for the English Bio TB in CONLL 2007 Version: README, v 0.1 2007/01/17 1. Source Mining the Bibliome project at the University of Pennsylvania See http://bioie.ldc.upenn.edu/ This data was extracted from the Oncology Treebank portion 2. CoNLL specifics - For the shared task, we sample sentences until 5000 tokens were extracted. Sentences with annoatations not conforming the the Penn TB guidelines were discarded. In particular, sentences with the HYPH or AFX part-of-speech tag were discarded. Traces and gapping annotations were modified to conform to standard Penn Treebank II guidelines. There are still a few slight differences. See http://bioie.ldc.upenn.edu/wiki/index.php/Main_Page for more details. - Data format adheres to the standard format for all langauges See http://depparse.uvt.nl/depparse-wiki/DataFormat - The data contains each word, a coarse grained part-of-speech, a a fine grained part-of-speech, the head of each token and the dependency relation - The fine grained part-of-speech is the gold standard part of speech tags from the data. - The coarse grained part-of-speech is just the first two characters of the fine grained part-of-speech. - The head and dependency relation fields were converted using the algorithms described in > Richard Johansson and Pierre Nugues (tentative title) > "Extended Constituent-to-Dependency Conversion for English" > (Submitted) > http://www.lucas.lth.se/lt/pennconverter This was run with the arguments: -conjAsHead -prepAsHead 3. Unlabeled data (for domain adaptation track) This data is in the folder unlab. It can consist of multiple files. If there are multiple files, each file is numbered. The smaller numbers indicate less data, but more relevant to the test set. This may be because tokenization is gold-standard (and not automatic), or that the data comes from documents more closely related to the test data. 4. Acknowledgements Thanks to the LDC and the University of Pennsylvania for providing the data. In particular, thanks to the BioIE project for creating the data set. Thanks to Richard Johansson for providing the software to convert from Phrase-Structure to dependencies.