Greek Dependency Treebank Documentation for the CoNLL 2007 Shared Task

Table of Contents
1. Preamble
2. Documentation
3. Acknowledgments
A. 2007 CoNLL Evaluation Agreement
B. Features

2010 Update Note: The dataset contained in this distribution is the Greek Dependency Treebank as it was converted and released for the purposes of the CoNLL 2007 Shared Task. After the addition of new annotated material in 2009-2010, GDT currently contains approximately 100K tokens with texts from Greek Wikipedia articles, manually normalized transcripts of European parliamentary sessions, and web documents pertaining the politics, health, and travel domains. For updates, please visit the GDT site, http://gdt.ilsp.gr. The following documentation corresponds to the 2007 edition.

1. Preamble

1.1. Source

The Greek Dependency Treebank (GDT) will become available from the CoNLL 2007 shared task organizers, for the purposes of the CoNLL 2007 shared task only.

1.2. Copyright

The Greek Dependency Treebank is copyrighted material.

The GDT corpus was collected by ILSP researchers in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel. The copyright to these textual data belongs to the authors of the original texts. The dependency annotations in the GDT were carried out by students of the postgraduate programme Technoglossia IV, organised by the Institute for Language and Speech Processing, the University of Athens and the National Technical University of Athens, following a postgraduate course in Corpus-based Linguistics taught by Haris Papageorgiou in 2005 with the assistance of Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera. The initial dependency annotations were validated by the ILSP researchers Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera.

1.3. License

The copyright owner of the Greek Dependency Treebank (Institute for Language and Speech Processing) grants the CoNLL 2007 shared task organizers and participants the right to use the Greek Dependency Treebank under the terms of the license agreement in Appendix A.

2. Documentation

2.1. Reference

Prokopis Prokopidis, Elina Desypri, Maria Koutsombogera, Haris Papageorgiou, and Stelios Piperidis. Theoretical and Practical Issues in the Construction of a Greek Dependency Treebank. In Montserrat Civit, Sandra Kübler, and Ma. Antònia Martí, editors, Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pages 149-160, Barcelona, Spain, December 2005. Universitat de Barcelona.

2.2. Data format

Data for the CoNLL 2007 shared task adheres to the following rules:

Data files contain one or more sentences separated by a blank line.
A sentence consists of one or more tokens, each one starting on a new line.
A token consists of ten fields described in Table 1. Fields are separated by one tab character.
All data files will contain these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages.
Data files are UTF-8 encoded.

Table 1. Description of the ten token fields

	Field:	Description:
1	ID	Token counter, starting at 1 for each new sentence
2	FORM	Word form or punctuation symbol
3	LEMMA	Lemma of word form. In the case of GDT, lemmas were automatically retrieved from ILSP's Greek morphological lexicon, but they were not manually corrected.
4	CPOSTAG	Coarse-grained part-of-speech tag. In the case of GDT, CPOS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible CPOS tags, see Table 2.
5	POSTAG	Fine-grained part-of-speech tag. In the case of GDT, POS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible fine-grained POS tags, see Table 3.
6	FEATS	List of set-valued syntactic and/or morphological features. In the case of GDT, these features were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible features, see Appendix B.
7	HEAD	Non-projective head of current token, which is either a value of ID or zero ('0').
8	DEPREL	Dependency relation to the non-projective-head. The GDT annotation schema is not currently available in English; for a short description of all possible dependency relations, see Table 4.
9	PHEAD	Projective head of current token. Always an underscore, since this information is not available in the GDT.
10	PDEPREL	Dependency relation to projective head. Always an underscore, since this information is not available in the GDT.

Table 2. Coarse-grained POS tags

Ad	Adverb
Aj	Adjective
AsPp	Preposition
At	Article
Cj	Conjunction
COMP	A composite word form
DATE	Date
DIG	Digit
ENUM	Enumeration element
INIT	Initial
LSPLIT	A word with a letter or syllable omitted at the end of the wordform (apocope)
Nm	Numeral
No	Noun
Pn	Pronoun
Pt	Particle
PUNCT	Punctuation symbol
Rg	Residual
Vb	Verb

Table 3. Fine-grained POS tags

Ad	Adverb
Aj	Adjective
AsPpPa	Preposition + Article combination
AsPpSp	Simple preposition
AtDf	Definite article
AtId	Indefinite article
CjCo	Coordinating conjunction
CjSb	Subordinating conjunction
COMP	A composite word form
DATE	Date
DIG	Digit
ENUM	Enumeration element
INIT	Initial
LSPLIT	A word with a letter or syllable omitted at the end of the wordform (apocope)
NmCd	Cardinal numeral
NmCt	Collective numeral
NmMl	Multiplicative numeral
NmOd	Ordinal numeral
NoCm	Common noun
NoPr	Proper noun
PnDm	Demonstrative pronoun
PnId	Indefinite pronoun
PnIr	Interrogative pronoun
PnPe	Personal pronoun
PnPo	Possessive pronoun
PnRe	Relative pronoun
PnRi	Relative indefinite pronoun
PtFu	Future particle
PtNe	Negative particle
PtOt	Other article
PtSj	Subjunctive particle
PUNCT	Punctuation symbol
RgAbXx	Abbreviation
RgAnXx	Acronym
RgFwOr	Foreign word in its original form
RgFwTr	Transliterated foreign word
VbIs	Impersonal verb
VbMn	Main verb

Table 4. Dependency Relations

Afun	Description
Pred	Main sentence predicate
Sb	Subject
Obj	Direct object
IObj	Indirect object
Pnom	Predicative dependent
Adv	Adverbial dependent
Atv	Adverbial predicative dependent
Atr	Attribute
AuxP	Prepositional node
AuxC	Conjunction node
Coord	A node governing coordination
Apos	A node governing apposition
*_Co	A node governed by a Coord
*_Ap	A node governed by an Apos
*_Pa	Head node of a parenthetical structure
AuxX	Comma
AuxV	Auxiliary node attached to a verb
AuxK	Terminal punctuation
AuxG	Auxiliary punctuation
ExD	A node whose real parent node is not present in the sentence (ellipsis)
AuxY	Other, auxiliary sentence elements

2.3. Text

The text material consists of texts, or extracts of texts, that were collected in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. The main domains covered are politics (current affairs and manual transcripts of European parliamentary sessions), health, and travel.

2.4. Statistics

Table 5. GDT Statistics

Sentences	2902
Tokens	70223
Tokens (non-punct)	63072
Types (non-punct)	13295
Lemmas	7017
Coarse POS Tags	18
Fine POS Tags	38
DepRels	45

2.5. Conversion

The conversion of the original annotation files (stored in the PDT *fs format) into the CoNLL tab format was made by Prokopis Prokopidis. Sentences of all files were randomly shuffled to generate the final training and testing data delivered to the CoNLL 2007 shared task organizers and participants.

3. Acknowledgments

The General Directorate of the Technoglossia postgraduate programme of studies.

Research Project Multimedia Content Management Systems (MUSE), E-Business 84. General Secreteriat for Research and Technology of the Greek Ministry of Development.

Research Project Retrieval of Video and Language for The Home user in an Information Society (Reveal This), FP6-IST-511689.

Students Antonopoulou Fotoula, Vlachou Argiro, Dimopoulou Maria, Drakopoulou Stella, Zourari Maria, Ilikevich Yuliya, Carayannidi Aphroditi, Carra Vasiliki, Kefalas Athanasios, Kioultzidi Lambrini, Lada Maria, Mamouzelou Anthi, Marzelou Euridiki, Mitrakou Harriet, Morfopoulou Vasiliki, Badavanou Sofia, Papagiannopoulou Aggeliki, Paschou Eustratia, Redoumi Vasiliki, Roditi Ioanna, Sakellaropoulou Theofani, Touribaba Aglaia, Tsagogeorga Dimitra, Tsarouchas Dimitrios, Theologou Maria, Antonopoulos Theodoros, Fakou Aikaterini, Nikta Marina, Gakis Dimitrios and Aggelou Epaminodas, for their ideas & annotation work during the course.

The Prague Dependency Treebank project for making available excellent open source tools for annotation and conversion of dependency trees; the annotation schema for the GDT was based on the original schema provided by the PDT.

Jens Nilsson for providing help in all issues concerning the shared task.

A. 2007 CoNLL Evaluation Agreement

---------------------------------------------------------------------
2007 CoNLL Evaluation Agreement

In the remainder of this document the term User refers to:

______________________________________ (Individual name)

and the term User's research group refers to:

_______________________________________ (University, Institute or Company name)

_______________________________________ (Specific department or area, if appropriate).

This letter describes the terms of an agreement between User and the
Institute for Language and Speech Processing (ILSP), in which User
will receive material as specified below.

Under this agreement, User will receive by email or ftp a copy of the
Greek Dependency Treebank (GDT) converted in a format suitable for the
2007 CoNLL shared task on dependency parsing. User agrees to use the
material received under this evaluation, and any resources derivative
from this material (e.g. parts of the GDT, statistical models based on
the GDT, modified versions of the GDT), only for the purposes of the
2007 CoNLL shared task. After participation has ended, User agrees to
delete the GDT copy from any computer or media onto which it has been
copied. User further agrees to delete any GDT derivatives that were
created during the 2007 CoNLL shared task. User further agrees not to
disclose, copy or redistribute the GDT or any of its derivatives to
others outside of the User's research group.

User agrees that the Institute for Language and Speech Processing does
not warrant the accuracy, completeness, currentness, merchantability
or fitness for a particular purpose of the information contained in
the GDT. In no event will the Institute for Language and Speech
Processing be liable to any authorized user, or anyone else for any
loss or injury caused in whole or in part by its negligence or
contingencies beyond its control in procuring, compiling,
interpreting, editing, writing, reporting or delivering the
information, or any errors, omissions or inaccuracies in the
information, regardless of how caused. In no event will the Institute
for Language and Speech Processing be liable to the organization, any
authorized user or anyone else for any decision made or action taken
by the organization or any authorized user in reliance upon any part
of the information or for any consequential, direct, special or
similar damages, even if advised of the possibility of such damages.

Corpora and/or Data Received:

CoNLL-2007 Shared Task Datasets (GDT)

Organization: ___________________________________________

Name: ___________________________________________________

Signature: ______________________________________________

Date: ___________________________________________________

E-mail (required): ______________________________________

For ILSP:

Stelios Piperidis
Head of the Department of Language Technology Applications
Institute for Language and Speech Processing
Artemidos 6 & Epidavrou
GR-151 25 Maroussi
Greece

B. Features

Table B-1. Ad (Adverb)

Degree
Ba (Basic)
Cp (Comparative)
Su (Superlative)

Table B-2. Aj (Adjective)

Degree	Gender	Number	Case
Ba (Basic)	Ma (Masculine)	Sg (Singular)	Nm (Nominative)
Cp (Comparative)	Fe (Feminine)	Pl (Plural)	Ge (Genitive)
Su (Superlative)	Ne (Neuter)		Ac (Accusative)
			Da (Dative)
			Vo (Vocative)

Table B-3. AsPpPa (Preposition + Article combination)

Gender	Number	Case
Ma (Masculine)	Sg (Singular)	Ac (Accusative)
Fe (Feminine)	Pl (Plural)	Ge (Genitive)
Ne (Neuter)

Table B-4. At (Article)

Gender	Number	Case
Ma (Masculine)	Sg (Singular)	Nm (Nominative)
Fe (Feminine)	Pl (Plural)	Ge (Genitive)
Ne (Neuter)		Ac (Accusative)
		Da (Dative)

Table B-5. Nm (Numeral)

Gender	Number	Case	Function
Ma (Masculine)	Sg (Singular)	Nm (Nominative)	Aj (Adjectival)
Fe (Feminine)	Pl (Plural)	Ge (Genitive)	No (Nominal)
Ne (Neuter)		Ac (Accusative)
		Da (Dative)
		Vo (Vocative)

Table B-6. No (Noun)

Gender	Number	Case
Ma (Masculine)	Sg (Singular)	Nm (Nominative)
Fe (Feminine)	Pl (Plural)	Ge (Genitive)
Ne (Neuter)		Ac (Accusative)
		Da (Dative)
		Vo (Vocative)

Table B-7. Pn (Pronoun)

Gender	Person	Number	Case	Inflection
Ma (Masculine)	01	Sg (Singular)	Nm (Nominative)	We (Weak)
Fe (Feminine)	02	Pl (Plural)	Ge (Genitive)	St (Strong)
Ne (Neuter)	03		Ac (Accusative)	Xx (No Value)
			Da (Dative)
			Vo (Vocative)

Table B-8. Vb (Verb)

Finiteness/Mood	Tense	Person	Number	Gender	Aspect	Voice	Case
Id (Indicative)	Pr (Present)	01	Sg (Singular)	Ma (Masculine)	Ip (Imperfective)	Av (Active)	Nm (Nominative)
Mp (Imperative)	Pa (Past)	02	Pl (Plural)	Fe (Feminine)	Pe (Perfective)	Pv (Passive)	Ge (Genitive)
Nf (Infinitive)	Xx (No Value)	03	Xx (No Value)	Ne (Neuter)			Ac (Accusative)
Pp (Participle)		Xx (No Value)		Xx (No Value)			Da (Dative)
							Vo (Vocative)
							Xx (No Value)