This is the readme file for the Egnlish Penn TB in CONLL 2007

Version: README, v 0.1 2007/01/17

1. Source

The Penn Treebank

See http://www.cis.upenn.edu/~treebank/ and
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42

Description from the LDC:

The Penn Treebank (PTB) project selected 2,499 stories from a three year
Wall Street Journal (WSJ) collection of 98,732 stories for syntactic
annotation. These 2,499 stories have been distributed in both Treebank-2
(LDC1999T42) and Treebank-3 (LDC1999T42) releases of PTB.


2. CoNLL specifics

- For the CoNLL shared task training set we used sections 02-11 of the WSJ

- Data format adheres to the standard format for all langauges
  See http://depparse.uvt.nl/depparse-wiki/DataFormat

- The data contains each word, a coarse grained part-of-speech, a
  a fine grained part-of-speech, the head of each token and the
  dependency relation

- The fine grained part-of-speech is the gold standard part of speech tags
  from the WSJ, details of which can be found,
  http://bulba.sdsu.edu/jeanette/thesis/PennTags.html
  or http://www.cis.upenn.edu/~treebank/

- The coarse grained part-of-speech is just the first two characters of
  the fine grained part-of-speech.

- The head and dependency relation fields were converted using the
  algorithms described in

  > Richard Johansson and Pierre Nugues (tentative title)
  > "Extended Constituent-to-Dependency Conversion for English"
  > (Submitted)
  > http://www.lucas.lth.se/lt/pennconverter

  This was run with the arguments: -conjAsHead -prepAsHead


3. Unlabeled data (for domain adaptation track)

This data is in the folder unlab. It can consist of multiple files.
If there are multiple files, each file is numbered. The smaller
numbers indicate less data, but more relevant to the test set.
This may be because tokenization is gold-standard (and not
automatic), or that the data comes from documents more closely
related to the test data.

4. Acknowledgements

Thanks to the LDC and the University of Pennsylvania for providing the
data. Thanks to Richard Johansson for providing the software to
convert from Phrase-Structure to dependencies.