This is the readme file for the English Bio TB in CONLL 2007

Version: README, v 0.1 2007/01/17

1. Source

Mining the Bibliome project at the University of Pennsylvania

See http://bioie.ldc.upenn.edu/

This data was extracted from the Oncology Treebank portion

2. CoNLL specifics

- For the shared task, we sample sentences until 5000 tokens were
  extracted. Sentences with annoatations not conforming the the
  Penn TB guidelines were discarded. In particular, sentences with
  the HYPH or AFX part-of-speech tag were discarded. Traces and
  gapping annotations were modified to conform to standard
  Penn Treebank II guidelines. There are still a few slight
  differences.
  See http://bioie.ldc.upenn.edu/wiki/index.php/Main_Page
  for more details.

- Data format adheres to the standard format for all langauges
  See http://depparse.uvt.nl/depparse-wiki/DataFormat

- The data contains each word, a coarse grained part-of-speech, a
  a fine grained part-of-speech, the head of each token and the
  dependency relation

- The fine grained part-of-speech is the gold standard part of
  speech tags from the data.

- The coarse grained part-of-speech is just the first two characters of
  the fine grained part-of-speech.

- The head and dependency relation fields were converted using the
  algorithms described in

  > Richard Johansson and Pierre Nugues (tentative title)
  > "Extended Constituent-to-Dependency Conversion for English"
  > (Submitted)
  > http://www.lucas.lth.se/lt/pennconverter

  This was run with the arguments: -conjAsHead -prepAsHead

3. Unlabeled data (for domain adaptation track)

This data is in the folder unlab. It can consist of multiple files.
If there are multiple files, each file is numbered. The smaller
numbers indicate less data, but more relevant to the test set.
This may be because tokenization is gold-standard (and not
automatic), or that the data comes from documents more closely
related to the test data.

4. Acknowledgements

Thanks to the LDC and the University of Pennsylvania for providing the
data. In particular, thanks to the BioIE project for creating the data
set. Thanks to Richard Johansson for providing the software to
convert from Phrase-Structure to dependencies.