This is the readme file for the Egnlish Penn TB in CONLL 2007 Version: README, v 0.1 2007/01/17 1. Source The Penn Treebank See http://www.cis.upenn.edu/~treebank/ and http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42 Description from the LDC: The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB. 2. CoNLL specifics - For the CoNLL shared task training set we used sections 02-11 of the WSJ - Data format adheres to the standard format for all langauges See http://depparse.uvt.nl/depparse-wiki/DataFormat - The data contains each word, a coarse grained part-of-speech, a a fine grained part-of-speech, the head of each token and the dependency relation - The fine grained part-of-speech is the gold standard part of speech tags from the WSJ, details of which can be found, http://bulba.sdsu.edu/jeanette/thesis/PennTags.html or http://www.cis.upenn.edu/~treebank/ - The coarse grained part-of-speech is just the first two characters of the fine grained part-of-speech. - The head and dependency relation fields were converted using the algorithms described in > Richard Johansson and Pierre Nugues (tentative title) > "Extended Constituent-to-Dependency Conversion for English" > (Submitted) > http://www.lucas.lth.se/lt/pennconverter This was run with the arguments: -conjAsHead -prepAsHead 3. Unlabeled data (for domain adaptation track) This data is in the folder unlab. It can consist of multiple files. If there are multiple files, each file is numbered. The smaller numbers indicate less data, but more relevant to the test set. This may be because tokenization is gold-standard (and not automatic), or that the data comes from documents more closely related to the test data. 4. Acknowledgements Thanks to the LDC and the University of Pennsylvania for providing the data. Thanks to Richard Johansson for providing the software to convert from Phrase-Structure to dependencies.