This is the readme file for the Japanse part of the CoNLL-X Shared Task. Version: $Id: README,v 1.3 2006/01/09 13:25:56 erwin Exp $ 1. Preamble 1.1 Source The data for the Japanse part of the CoNLL-X Shared Task was derived from the Verbmobil Treebank for Japanese. 1.2 Copyright The copyright of the Verbmobil Treebank for Japanese belongs to Eberhard-Karls-Universitaet Tuebingen, Seminar fuer Sprachwissenschaft, Abt. Computerlinguistik. 1.3 License This data is made available for the duration of the CoNLL-X Shared Task under the license in the file license.txt. 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the table below. Fields are separated by a single tab character. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are are UTF-8 encoded (unicode). Field 1: ID Token counter, starting at 1 for each new sentence. Field 2: FORM Word form or punctuation symbol Field 3: LEMMA Stem of word form. Not available for Japanese, so this field contains always an underscore. Field 4: CPOSTAG Coarse-grained part-of-speech tag. The reduction from fine-grained to coarse-grained POS tags is defined in the file finecoarse.table, which also describes the tags. A full description of the tagset can be found in Chapter 4 of report-240-00.ps Field 5: POSTAG Fine-grained part-of-speech tag, as in the original treebank. For more information, see Chapter 4 in the file report-240-00.ps Field 6: FEATS List of additional morphological features. ------------------------------------------------------------ Values: Description: ------------------------------------------------------------ eN VAUXfin/VSfin/Vfin {eg.-maseN} kute ADJi/VADJi/PADJ -kute {eg. aka-kute, waru-kute} ta ADJi/VADJi/Vfin/PVfin/VAUXfin/VSfin/ -d/ta {eg. aka-kat-ta, tabe-ta, deshita} (perfect) u V/PV/VAUX/VS-fin -u {eg. iku, taberu, desu, deshou} - None ------------------------------------------------------------ Fields 7: HEAD Non-projective head of current token, which is either a value of ID or zero ('0') Field 8: DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. ------------------------ Deprel: Description: ------------------------ ADJ Adjunct COMP Complement HD Co-head MRK Marker PUNCT Punctuation SBJ Subject - Unspecified ------------------------ The HD relation holds for words which have edge label 'HD' in the original phrase structure tree, but where another daughter (marked as 'HD' as well) was chosen to be the head in of the dependency structure. For more information on the dependency relations, see Chapter 6 in the the file report-240-00.ps Field 9: PHEAD Projective head of current token, which is identical to HEAD as the original treebank is already projective. Field 10: PDEPREL Dependency relation to projective head, which is identical to PDEPREL as the original treebank is already projective. 2.2 Text The text material consists of transcriptions of dialogues in which two discourse participants negotiate business appointments. The text is transcribed in Romaji, i.e. using Latin letters. No transcription in Japanese characters is available. 2.3 Statistics ------------------------------- #sentences 17753 #tokens 157172 #non-punct tokens 138932 #non-punct types 36329 #coarse pos tags 23 #fine pos tags 81 #deprels 9 ------------------------------- 2.4 Conversion In general, the head was determined by looking at the constituent structure, and for each phrase taking the daughter with edge label 'HD'. In case of no head, the right-most child was chosen. In case of multiple heads, the right-most head was taken. The conversion of fine-grained to coarse-grained pos tags was accomplished basically by striping the final, lower-case characters from the pos tag, retaining the initial, upper-case characters. Punctuation, which was originally attached to the ROOT (i.e. HEAD=0), was reattached to directly preceding token. Commas only appeared in one particular corpus segment, i.e. in cd32.export. 3. Acknowledgements Yasuhiro Kawata, Julia Bartels and colleagues from Tuebingen University for construction of the original Verbmobil treebank for Japanese. Sandra Kuebler for granting the special license for CoNLL-X and providing the data.