This is the readme file for the Japanse part of the CoNLL-X Shared Task.

Version: $Id: README,v 1.3 2006/01/09 13:25:56 erwin Exp $


1. Preamble

    1.1 Source

        The data for the Japanse part of the CoNLL-X Shared Task was
        derived from the Verbmobil Treebank for Japanese.

    1.2 Copyright

	The copyright of the Verbmobil Treebank for Japanese belongs
	to Eberhard-Karls-Universitaet Tuebingen, Seminar fuer
	Sprachwissenschaft, Abt. Computerlinguistik.

    1.3 License

    	This data is made available for the duration of the CoNLL-X
	Shared Task under the license in the file license.txt.

2. Documentation

    2.1 Data format

    	Data adheres to the following rules:

    	* Data files contain one or more sentences separated by a
	  blank line.

    	* A sentence consists of one or tokens, each one starting on a
	  new line.

    	* A token consists of ten fields described in the table
	  below. Fields are separated by a single tab character. 

	* All data files will contains these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

	* Data files are are UTF-8 encoded (unicode).


	Field 1: ID 	

	    Token counter, starting at 1 for each new sentence.

	Field 2: FORM

	    Word form or punctuation symbol

	Field 3: LEMMA 	

	    Stem of word form. Not available for Japanese, so this
	    field contains always an underscore.

	Field 4: CPOSTAG 

	    Coarse-grained part-of-speech tag. The reduction from
	    fine-grained to coarse-grained POS tags is defined in the
	    file finecoarse.table, which also describes the tags. A
	    full description of the tagset can be found in Chapter 4
	    of report-240-00.ps

	Field 5: POSTAG 	

	    Fine-grained part-of-speech tag, as in the original
	    treebank. For more information, see Chapter 4 in the
	    file report-240-00.ps

	Field 6: FEATS 	

	    List of additional  morphological features. 

	    ------------------------------------------------------------
	    Values:	Description:   
	    ------------------------------------------------------------
	    eN		VAUXfin/VSfin/Vfin 
			{eg.-maseN}
	    kute	ADJi/VADJi/PADJ  -kute 
			{eg. aka-kute, waru-kute}
	    ta		ADJi/VADJi/Vfin/PVfin/VAUXfin/VSfin/ -d/ta
			{eg. aka-kat-ta, tabe-ta, deshita} (perfect)
	    u		V/PV/VAUX/VS-fin -u   
			{eg. iku, taberu, desu, deshou}
	    -		None
	    ------------------------------------------------------------
	
	Fields 7: HEAD 	

	    Non-projective head of current token, which is either a
	    value of ID or zero ('0')
	
	Field 8: DEPREL 	

	    Dependency relation to the non-projective-head, which is
	    'ROOT' when the value of HEAD is zero.

	    ------------------------
	    Deprel:	Description:
	    ------------------------
	    ADJ	   	Adjunct
	    COMP   	Complement
	    HD		Co-head
	    MRK		Marker
	    PUNCT	Punctuation
	    SBJ		Subject
	    -		Unspecified
	    ------------------------

	    The HD relation holds for words which have edge label 'HD'
	    in the original phrase structure tree, but where another
	    daughter (marked as 'HD' as well) was chosen to be the
	    head in of the dependency structure.

	    For more information on the dependency relations, see
	    Chapter 6 in the the file report-240-00.ps

	Field 9: PHEAD 	

	    Projective head of current token, which is identical to
	    HEAD as the original treebank is already projective.

	Field 10: PDEPREL 

	    Dependency relation to projective head, which  is identical to
	    PDEPREL as the original treebank is already projective.


    2.2 Text

	The text material consists of transcriptions of dialogues
	in which two discourse participants negotiate business
	appointments.

	The text is transcribed in Romaji, i.e. using Latin letters.
	No transcription in Japanese characters is available.


    2.3 Statistics

	-------------------------------
	#sentences		 17753
	#tokens			157172
	#non-punct tokens	138932
	#non-punct types	 36329 

	#coarse pos tags	    23
	#fine pos tags		    81
	#deprels		     9
	-------------------------------	


    2.4 Conversion

	In general, the head was determined by looking at the
	constituent structure, and for each phrase taking the daughter
	with edge label 'HD'.  In case of no head, the right-most
	child was chosen. In case of multiple heads, the right-most
	head was taken.

	The conversion of fine-grained to coarse-grained pos tags was
	accomplished basically by striping the final, lower-case
	characters from the pos tag, retaining the initial, upper-case
	characters.

    	Punctuation, which was originally attached to the ROOT
    	(i.e. HEAD=0), was reattached to directly preceding token.

	Commas only appeared in one particular corpus segment, i.e. in
	cd32.export.


3. Acknowledgements

	Yasuhiro Kawata, Julia Bartels and colleagues from Tuebingen 
	University for construction of the original Verbmobil treebank
	for Japanese.

	Sandra Kuebler for granting the special license for CoNLL-X
	and providing the data.