This is the readme file for the English CHILDES data in CONLL 2007

Version: README, v 0.1 2007/01/17

1. Source

CHILDES data base of parent-child dialogues.
See: http://childes.psy.cmu.edu/

2. CoNLL specifics

- For the shared task, we sample sentences until 5000 tokens were
  extracted. Sentences of less than 4 were discarded.

- Data format adheres to the standard format for all langauges
  See http://depparse.uvt.nl/depparse-wiki/DataFormat

- The data contains each word, a coarse grained part-of-speech, a
  a fine grained part-of-speech, the head of each token and the
  dependency relation. 

- The fine grained part-of-speech is assigned using a maximum entropy
  part of speech tagger.

- The coarse grained part-of-speech is just the first two characters of
  the fine grained part-of-speech.

- CHILDES is annotated with grammatical relations (aka dependencies)
  so no conversions to dependencies were required.

3. Unlabeled data (for domain adaptation track)

This data is in the folder unlab. This was extracted from the English-USA
corpus at http://childes.psy.cmu.edu/data/local.html

We only provided the words (to be consistent with other data sets),
however, the original data also contains the gold standard morphology.

4. Acknowledgements

Thanks to Brian MacWhinney, Alon Lavie, Kenji Sagae, Shuly Wintner,
and the entire CHILDES project for creating this data and agreeing
to include it in the shared-task.