This is the readme file for the English CHILDES data in CONLL 2007 Version: README, v 0.1 2007/01/17 1. Source CHILDES data base of parent-child dialogues. See: http://childes.psy.cmu.edu/ 2. CoNLL specifics - For the shared task, we sample sentences until 5000 tokens were extracted. Sentences of less than 4 were discarded. - Data format adheres to the standard format for all langauges See http://depparse.uvt.nl/depparse-wiki/DataFormat - The data contains each word, a coarse grained part-of-speech, a a fine grained part-of-speech, the head of each token and the dependency relation. - The fine grained part-of-speech is assigned using a maximum entropy part of speech tagger. - The coarse grained part-of-speech is just the first two characters of the fine grained part-of-speech. - CHILDES is annotated with grammatical relations (aka dependencies) so no conversions to dependencies were required. 3. Unlabeled data (for domain adaptation track) This data is in the folder unlab. This was extracted from the English-USA corpus at http://childes.psy.cmu.edu/data/local.html We only provided the words (to be consistent with other data sets), however, the original data also contains the gold standard morphology. 4. Acknowledgements Thanks to Brian MacWhinney, Alon Lavie, Kenji Sagae, Shuly Wintner, and the entire CHILDES project for creating this data and agreeing to include it in the shared-task.