Greek Dependency Treebank Documentation for the CoNLL 2007 Shared Task


Table of Contents
1. Preamble
2. Documentation
3. Acknowledgments
A. 2007 CoNLL Evaluation Agreement
B. Features

2010 Update Note: The dataset contained in this distribution is the Greek Dependency Treebank as it was converted and released for the purposes of the CoNLL 2007 Shared Task. After the addition of new annotated material in 2009-2010, GDT currently contains approximately 100K tokens with texts from Greek Wikipedia articles, manually normalized transcripts of European parliamentary sessions, and web documents pertaining the politics, health, and travel domains. For updates, please visit the GDT site, http://gdt.ilsp.gr. The following documentation corresponds to the 2007 edition.


1. Preamble

1.1. Source

The Greek Dependency Treebank (GDT) will become available from the CoNLL 2007 shared task organizers, for the purposes of the CoNLL 2007 shared task only.


1.2. Copyright

The Greek Dependency Treebank is copyrighted material.

* (c) 2005-2007, by the Institute for Language and Speech Processing. ILSP owns the copyright to all automatic and manually-validated annotations in the GDT.

The GDT corpus was collected by ILSP researchers in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel. The copyright to these textual data belongs to the authors of the original texts. The dependency annotations in the GDT were carried out by students of the postgraduate programme Technoglossia IV, organised by the Institute for Language and Speech Processing, the University of Athens and the National Technical University of Athens, following a postgraduate course in Corpus-based Linguistics taught by Haris Papageorgiou in 2005 with the assistance of Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera. The initial dependency annotations were validated by the ILSP researchers Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera.


1.3. License

The copyright owner of the Greek Dependency Treebank (Institute for Language and Speech Processing) grants the CoNLL 2007 shared task organizers and participants the right to use the Greek Dependency Treebank under the terms of the license agreement in Appendix A.


2. Documentation

2.1. Reference

Prokopis Prokopidis, Elina Desypri, Maria Koutsombogera, Haris Papageorgiou, and Stelios Piperidis. Theoretical and Practical Issues in the Construction of a Greek Dependency Treebank. In Montserrat Civit, Sandra Kübler, and Ma. Antònia Martí, editors, Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pages 149-160, Barcelona, Spain, December 2005. Universitat de Barcelona.


2.2. Data format

Data for the CoNLL 2007 shared task adheres to the following rules:

  • Data files contain one or more sentences separated by a blank line.

  • A sentence consists of one or more tokens, each one starting on a new line.

  • A token consists of ten fields described in Table 1. Fields are separated by one tab character.

  • All data files will contain these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages.

  • Data files are UTF-8 encoded.

Table 1. Description of the ten token fields

 Field:Description:
1IDToken counter, starting at 1 for each new sentence
2FORMWord form or punctuation symbol
3LEMMALemma of word form. In the case of GDT, lemmas were automatically retrieved from ILSP's Greek morphological lexicon, but they were not manually corrected.
4CPOSTAGCoarse-grained part-of-speech tag. In the case of GDT, CPOS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible CPOS tags, see Table 2.
5POSTAGFine-grained part-of-speech tag. In the case of GDT, POS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible fine-grained POS tags, see Table 3.
6FEATSList of set-valued syntactic and/or morphological features. In the case of GDT, these features were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible features, see Appendix B.
7HEADNon-projective head of current token, which is either a value of ID or zero ('0').
8DEPRELDependency relation to the non-projective-head. The GDT annotation schema is not currently available in English; for a short description of all possible dependency relations, see Table 4.
9PHEADProjective head of current token. Always an underscore, since this information is not available in the GDT.
10PDEPRELDependency relation to projective head. Always an underscore, since this information is not available in the GDT.

Table 2. Coarse-grained POS tags

AdAdverb
AjAdjective
AsPpPreposition
AtArticle
CjConjunction
COMPA composite word form
DATEDate
DIGDigit
ENUMEnumeration element
INITInitial
LSPLITA word with a letter or syllable omitted at the end of the wordform (apocope)
NmNumeral
NoNoun
PnPronoun
PtParticle
PUNCTPunctuation symbol
RgResidual
VbVerb

Table 3. Fine-grained POS tags

AdAdverb
AjAdjective
AsPpPaPreposition + Article combination
AsPpSpSimple preposition
AtDfDefinite article
AtIdIndefinite article
CjCoCoordinating conjunction
CjSbSubordinating conjunction
COMPA composite word form
DATEDate
DIGDigit
ENUMEnumeration element
INITInitial
LSPLITA word with a letter or syllable omitted at the end of the wordform (apocope)
NmCdCardinal numeral
NmCtCollective numeral
NmMlMultiplicative numeral
NmOdOrdinal numeral
NoCmCommon noun
NoPrProper noun
PnDmDemonstrative pronoun
PnIdIndefinite pronoun
PnIrInterrogative pronoun
PnPePersonal pronoun
PnPoPossessive pronoun
PnReRelative pronoun
PnRiRelative indefinite pronoun
PtFuFuture particle
PtNeNegative particle
PtOtOther article
PtSjSubjunctive particle
PUNCTPunctuation symbol
RgAbXxAbbreviation
RgAnXxAcronym
RgFwOrForeign word in its original form
RgFwTrTransliterated foreign word
VbIsImpersonal verb
VbMnMain verb

Table 4. Dependency Relations

AfunDescription
PredMain sentence predicate
SbSubject
ObjDirect object
IObjIndirect object
PnomPredicative dependent
AdvAdverbial dependent
AtvAdverbial predicative dependent
AtrAttribute
AuxPPrepositional node
AuxCConjunction node
CoordA node governing coordination
AposA node governing apposition
*_CoA node governed by a Coord
*_ApA node governed by an Apos
*_PaHead node of a parenthetical structure
AuxXComma
AuxVAuxiliary node attached to a verb
AuxKTerminal punctuation
AuxGAuxiliary punctuation
ExDA node whose real parent node is not present in the sentence (ellipsis)
AuxYOther, auxiliary sentence elements

2.3. Text

The text material consists of texts, or extracts of texts, that were collected in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. The main domains covered are politics (current affairs and manual transcripts of European parliamentary sessions), health, and travel.


2.4. Statistics

Table 5. GDT Statistics

Sentences2902
Tokens70223
Tokens (non-punct)63072
Types (non-punct)13295
Lemmas7017
Coarse POS Tags18
Fine POS Tags38
DepRels45

2.5. Conversion

The conversion of the original annotation files (stored in the PDT *fs format) into the CoNLL tab format was made by Prokopis Prokopidis. Sentences of all files were randomly shuffled to generate the final training and testing data delivered to the CoNLL 2007 shared task organizers and participants.


3. Acknowledgments

The General Directorate of the Technoglossia postgraduate programme of studies.

Research Project Multimedia Content Management Systems (MUSE), E-Business 84. General Secreteriat for Research and Technology of the Greek Ministry of Development.

Research Project Retrieval of Video and Language for The Home user in an Information Society (Reveal This), FP6-IST-511689.

Students Antonopoulou Fotoula, Vlachou Argiro, Dimopoulou Maria, Drakopoulou Stella, Zourari Maria, Ilikevich Yuliya, Carayannidi Aphroditi, Carra Vasiliki, Kefalas Athanasios, Kioultzidi Lambrini, Lada Maria, Mamouzelou Anthi, Marzelou Euridiki, Mitrakou Harriet, Morfopoulou Vasiliki, Badavanou Sofia, Papagiannopoulou Aggeliki, Paschou Eustratia, Redoumi Vasiliki, Roditi Ioanna, Sakellaropoulou Theofani, Touribaba Aglaia, Tsagogeorga Dimitra, Tsarouchas Dimitrios, Theologou Maria, Antonopoulos Theodoros, Fakou Aikaterini, Nikta Marina, Gakis Dimitrios and Aggelou Epaminodas, for their ideas & annotation work during the course.

The Prague Dependency Treebank project for making available excellent open source tools for annotation and conversion of dependency trees; the annotation schema for the GDT was based on the original schema provided by the PDT.

Jens Nilsson for providing help in all issues concerning the shared task.


A. 2007 CoNLL Evaluation Agreement


---------------------------------------------------------------------
2007 CoNLL Evaluation Agreement

In the remainder of this document the term User refers to:

______________________________________ (Individual name)

and the term User's research group refers to:

_______________________________________ (University, Institute or Company name)

_______________________________________ (Specific department or area, if appropriate).

This letter describes the terms of an agreement between User and the
Institute for Language and Speech Processing (ILSP), in which User
will receive material as specified below.

Under this agreement, User will receive by email or ftp a copy of the
Greek Dependency Treebank (GDT) converted in a format suitable for the
2007 CoNLL shared task on dependency parsing. User agrees to use the
material received under this evaluation, and any resources derivative
from this material (e.g. parts of the GDT, statistical models based on
the GDT, modified versions of the GDT), only for the purposes of the
2007 CoNLL shared task.  After participation has ended, User agrees to
delete the GDT copy from any computer or media onto which it has been
copied.  User further agrees to delete any GDT derivatives that were
created during the 2007 CoNLL shared task. User further agrees not to
disclose, copy or redistribute the GDT or any of its derivatives to
others outside of the User's research group.

User agrees that the Institute for Language and Speech Processing does
not warrant the accuracy, completeness, currentness, merchantability
or fitness for a particular purpose of the information contained in
the GDT. In no event will the Institute for Language and Speech
Processing be liable to any authorized user, or anyone else for any
loss or injury caused in whole or in part by its negligence or
contingencies beyond its control in procuring, compiling,
interpreting, editing, writing, reporting or delivering the
information, or any errors, omissions or inaccuracies in the
information, regardless of how caused. In no event will the Institute
for Language and Speech Processing be liable to the organization, any
authorized user or anyone else for any decision made or action taken
by the organization or any authorized user in reliance upon any part
of the information or for any consequential, direct, special or
similar damages, even if advised of the possibility of such damages.

Corpora and/or Data Received:

CoNLL-2007 Shared Task Datasets (GDT)

Organization: ___________________________________________

Name: ___________________________________________________

Signature: ______________________________________________

Date: ___________________________________________________

E-mail (required): ______________________________________


For ILSP:

Stelios Piperidis
Head of the Department of Language Technology Applications
Institute for Language and Speech Processing
Artemidos 6 & Epidavrou
GR-151 25 Maroussi
Greece


B. Features

Table B-1. Ad (Adverb)

Degree
Ba (Basic)
Cp (Comparative)
Su (Superlative)

Table B-2. Aj (Adjective)

DegreeGenderNumberCase
Ba (Basic)Ma (Masculine)Sg (Singular)Nm (Nominative)
Cp (Comparative)Fe (Feminine)Pl (Plural)Ge (Genitive)
Su (Superlative)Ne (Neuter) Ac (Accusative)
   Da (Dative)
   Vo (Vocative)

Table B-3. AsPpPa (Preposition + Article combination)

GenderNumberCase
Ma (Masculine)Sg (Singular)Ac (Accusative)
Fe (Feminine)Pl (Plural)Ge (Genitive)
Ne (Neuter)  

Table B-4. At (Article)

GenderNumberCase
Ma (Masculine)Sg (Singular)Nm (Nominative)
Fe (Feminine)Pl (Plural)Ge (Genitive)
Ne (Neuter) Ac (Accusative)
  Da (Dative)

Table B-5. Nm (Numeral)

GenderNumberCaseFunction
Ma (Masculine)Sg (Singular)Nm (Nominative)Aj (Adjectival)
Fe (Feminine)Pl (Plural)Ge (Genitive)No (Nominal)
Ne (Neuter) Ac (Accusative) 
  Da (Dative) 
  Vo (Vocative) 

Table B-6. No (Noun)

GenderNumberCase
Ma (Masculine)Sg (Singular)Nm (Nominative)
Fe (Feminine)Pl (Plural)Ge (Genitive)
Ne (Neuter) Ac (Accusative)
  Da (Dative)
  Vo (Vocative)

Table B-7. Pn (Pronoun)

GenderPersonNumberCaseInflection
Ma (Masculine)01Sg (Singular)Nm (Nominative)We (Weak)
Fe (Feminine)02Pl (Plural)Ge (Genitive)St (Strong)
Ne (Neuter)03 Ac (Accusative)Xx (No Value)
   Da (Dative) 
   Vo (Vocative) 

Table B-8. Vb (Verb)

Finiteness/MoodTensePersonNumberGenderAspectVoiceCase
Id (Indicative)Pr (Present)01Sg (Singular)Ma (Masculine)Ip (Imperfective)Av (Active)Nm (Nominative)
Mp (Imperative)Pa (Past)02Pl (Plural)Fe (Feminine)Pe (Perfective)Pv (Passive)Ge (Genitive)
Nf (Infinitive)Xx (No Value)03Xx (No Value)Ne (Neuter)  Ac (Accusative)
Pp (Participle) Xx (No Value) Xx (No Value)  Da (Dative)
       Vo (Vocative)
       Xx (No Value)