2010 Update Note: The dataset contained in this distribution is the Greek Dependency Treebank as it was converted and released for the purposes of the CoNLL 2007 Shared Task. After the addition of new annotated material in 2009-2010, GDT currently contains approximately 100K tokens with texts from Greek Wikipedia articles, manually normalized transcripts of European parliamentary sessions, and web documents pertaining the politics, health, and travel domains. For updates, please visit the GDT site, http://gdt.ilsp.gr. The following documentation corresponds to the 2007 edition.
The Greek Dependency Treebank (GDT) will become available from the CoNLL 2007 shared task organizers, for the purposes of the CoNLL 2007 shared task only.
The Greek Dependency Treebank is copyrighted material.
* (c) 2005-2007, by the Institute for Language and Speech Processing. ILSP owns the copyright to all automatic and manually-validated annotations in the GDT.
The GDT corpus was collected by ILSP researchers in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel. The copyright to these textual data belongs to the authors of the original texts. The dependency annotations in the GDT were carried out by students of the postgraduate programme Technoglossia IV, organised by the Institute for Language and Speech Processing, the University of Athens and the National Technical University of Athens, following a postgraduate course in Corpus-based Linguistics taught by Haris Papageorgiou in 2005 with the assistance of Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera. The initial dependency annotations were validated by the ILSP researchers Prokopis Prokopidis, Elina Desypri and Maria Koutsombogera.
The copyright owner of the Greek Dependency Treebank (Institute for Language and Speech Processing) grants the CoNLL 2007 shared task organizers and participants the right to use the Greek Dependency Treebank under the terms of the license agreement in Appendix A.
Prokopis Prokopidis, Elina Desypri, Maria Koutsombogera, Haris Papageorgiou, and Stelios Piperidis. Theoretical and Practical Issues in the Construction of a Greek Dependency Treebank. In Montserrat Civit, Sandra Kübler, and Ma. Antònia Martí, editors, Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pages 149-160, Barcelona, Spain, December 2005. Universitat de Barcelona.
Data for the CoNLL 2007 shared task adheres to the following rules:
Data files contain one or more sentences separated by a blank line.
A sentence consists of one or more tokens, each one starting on a new line.
A token consists of ten fields described in Table 1. Fields are separated by one tab character.
All data files will contain these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages.
Data files are UTF-8 encoded.
Table 1. Description of the ten token fields
Field: | Description: | |
---|---|---|
1 | ID | Token counter, starting at 1 for each new sentence |
2 | FORM | Word form or punctuation symbol |
3 | LEMMA | Lemma of word form. In the case of GDT, lemmas were automatically retrieved from ILSP's Greek morphological lexicon, but they were not manually corrected. |
4 | CPOSTAG | Coarse-grained part-of-speech tag. In the case of GDT, CPOS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible CPOS tags, see Table 2. |
5 | POSTAG | Fine-grained part-of-speech tag. In the case of GDT, POS tags were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible fine-grained POS tags, see Table 3. |
6 | FEATS | List of set-valued syntactic and/or morphological features. In the case of GDT, these features were automatically assigned by a tagger developed at ILSP, but they were not manually corrected. For a description of all possible features, see Appendix B. |
7 | HEAD | Non-projective head of current token, which is either a value of ID or zero ('0'). |
8 | DEPREL | Dependency relation to the non-projective-head. The GDT annotation schema is not currently available in English; for a short description of all possible dependency relations, see Table 4. |
9 | PHEAD | Projective head of current token. Always an underscore, since this information is not available in the GDT. |
10 | PDEPREL | Dependency relation to projective head. Always an underscore, since this information is not available in the GDT. |
Table 2. Coarse-grained POS tags
Ad | Adverb |
Aj | Adjective |
AsPp | Preposition |
At | Article |
Cj | Conjunction |
COMP | A composite word form |
DATE | Date |
DIG | Digit |
ENUM | Enumeration element |
INIT | Initial |
LSPLIT | A word with a letter or syllable omitted at the end of the wordform (apocope) |
Nm | Numeral |
No | Noun |
Pn | Pronoun |
Pt | Particle |
PUNCT | Punctuation symbol |
Rg | Residual |
Vb | Verb |
Table 3. Fine-grained POS tags
Ad | Adverb |
Aj | Adjective |
AsPpPa | Preposition + Article combination |
AsPpSp | Simple preposition |
AtDf | Definite article |
AtId | Indefinite article |
CjCo | Coordinating conjunction |
CjSb | Subordinating conjunction |
COMP | A composite word form |
DATE | Date |
DIG | Digit |
ENUM | Enumeration element |
INIT | Initial |
LSPLIT | A word with a letter or syllable omitted at the end of the wordform (apocope) |
NmCd | Cardinal numeral |
NmCt | Collective numeral |
NmMl | Multiplicative numeral |
NmOd | Ordinal numeral |
NoCm | Common noun |
NoPr | Proper noun |
PnDm | Demonstrative pronoun |
PnId | Indefinite pronoun |
PnIr | Interrogative pronoun |
PnPe | Personal pronoun |
PnPo | Possessive pronoun |
PnRe | Relative pronoun |
PnRi | Relative indefinite pronoun |
PtFu | Future particle |
PtNe | Negative particle |
PtOt | Other article |
PtSj | Subjunctive particle |
PUNCT | Punctuation symbol |
RgAbXx | Abbreviation |
RgAnXx | Acronym |
RgFwOr | Foreign word in its original form |
RgFwTr | Transliterated foreign word |
VbIs | Impersonal verb |
VbMn | Main verb |
Table 4. Dependency Relations
Afun | Description |
---|---|
Pred | Main sentence predicate |
Sb | Subject |
Obj | Direct object |
IObj | Indirect object |
Pnom | Predicative dependent |
Adv | Adverbial dependent |
Atv | Adverbial predicative dependent |
Atr | Attribute |
AuxP | Prepositional node |
AuxC | Conjunction node |
Coord | A node governing coordination |
Apos | A node governing apposition |
*_Co | A node governed by a Coord |
*_Ap | A node governed by an Apos |
*_Pa | Head node of a parenthetical structure |
AuxX | Comma |
AuxV | Auxiliary node attached to a verb |
AuxK | Terminal punctuation |
AuxG | Auxiliary punctuation |
ExD | A node whose real parent node is not present in the sentence (ellipsis) |
AuxY | Other, auxiliary sentence elements |
The text material consists of texts, or extracts of texts, that were collected in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. The main domains covered are politics (current affairs and manual transcripts of European parliamentary sessions), health, and travel.
The conversion of the original annotation files (stored in the PDT *fs format) into the CoNLL tab format was made by Prokopis Prokopidis. Sentences of all files were randomly shuffled to generate the final training and testing data delivered to the CoNLL 2007 shared task organizers and participants.
The General Directorate of the Technoglossia postgraduate programme of studies.
Research Project Multimedia Content Management Systems (MUSE), E-Business 84. General Secreteriat for Research and Technology of the Greek Ministry of Development.
Research Project Retrieval of Video and Language for The Home user in an Information Society (Reveal This), FP6-IST-511689.
Students Antonopoulou Fotoula, Vlachou Argiro, Dimopoulou Maria, Drakopoulou Stella, Zourari Maria, Ilikevich Yuliya, Carayannidi Aphroditi, Carra Vasiliki, Kefalas Athanasios, Kioultzidi Lambrini, Lada Maria, Mamouzelou Anthi, Marzelou Euridiki, Mitrakou Harriet, Morfopoulou Vasiliki, Badavanou Sofia, Papagiannopoulou Aggeliki, Paschou Eustratia, Redoumi Vasiliki, Roditi Ioanna, Sakellaropoulou Theofani, Touribaba Aglaia, Tsagogeorga Dimitra, Tsarouchas Dimitrios, Theologou Maria, Antonopoulos Theodoros, Fakou Aikaterini, Nikta Marina, Gakis Dimitrios and Aggelou Epaminodas, for their ideas & annotation work during the course.
The Prague Dependency Treebank project for making available excellent open source tools for annotation and conversion of dependency trees; the annotation schema for the GDT was based on the original schema provided by the PDT.
Jens Nilsson for providing help in all issues concerning the shared task.
---------------------------------------------------------------------
2007 CoNLL Evaluation Agreement
In the remainder of this document the term User refers to:
______________________________________ (Individual name)
and the term User's research group refers to:
_______________________________________ (University, Institute or Company name)
_______________________________________ (Specific department or area, if appropriate).
This letter describes the terms of an agreement between User and the
Institute for Language and Speech Processing (ILSP), in which User
will receive material as specified below.
Under this agreement, User will receive by email or ftp a copy of the
Greek Dependency Treebank (GDT) converted in a format suitable for the
2007 CoNLL shared task on dependency parsing. User agrees to use the
material received under this evaluation, and any resources derivative
from this material (e.g. parts of the GDT, statistical models based on
the GDT, modified versions of the GDT), only for the purposes of the
2007 CoNLL shared task. After participation has ended, User agrees to
delete the GDT copy from any computer or media onto which it has been
copied. User further agrees to delete any GDT derivatives that were
created during the 2007 CoNLL shared task. User further agrees not to
disclose, copy or redistribute the GDT or any of its derivatives to
others outside of the User's research group.
User agrees that the Institute for Language and Speech Processing does
not warrant the accuracy, completeness, currentness, merchantability
or fitness for a particular purpose of the information contained in
the GDT. In no event will the Institute for Language and Speech
Processing be liable to any authorized user, or anyone else for any
loss or injury caused in whole or in part by its negligence or
contingencies beyond its control in procuring, compiling,
interpreting, editing, writing, reporting or delivering the
information, or any errors, omissions or inaccuracies in the
information, regardless of how caused. In no event will the Institute
for Language and Speech Processing be liable to the organization, any
authorized user or anyone else for any decision made or action taken
by the organization or any authorized user in reliance upon any part
of the information or for any consequential, direct, special or
similar damages, even if advised of the possibility of such damages.
Corpora and/or Data Received:
CoNLL-2007 Shared Task Datasets (GDT)
Organization: ___________________________________________
Name: ___________________________________________________
Signature: ______________________________________________
Date: ___________________________________________________
E-mail (required): ______________________________________
For ILSP:
Stelios Piperidis
Head of the Department of Language Technology Applications
Institute for Language and Speech Processing
Artemidos 6 & Epidavrou
GR-151 25 Maroussi
Greece
Table B-2. Aj (Adjective)
Degree | Gender | Number | Case |
---|---|---|---|
Ba (Basic) | Ma (Masculine) | Sg (Singular) | Nm (Nominative) |
Cp (Comparative) | Fe (Feminine) | Pl (Plural) | Ge (Genitive) |
Su (Superlative) | Ne (Neuter) | Ac (Accusative) | |
Da (Dative) | |||
Vo (Vocative) |
Table B-3. AsPpPa (Preposition + Article combination)
Gender | Number | Case |
---|---|---|
Ma (Masculine) | Sg (Singular) | Ac (Accusative) |
Fe (Feminine) | Pl (Plural) | Ge (Genitive) |
Ne (Neuter) |
Table B-4. At (Article)
Gender | Number | Case |
---|---|---|
Ma (Masculine) | Sg (Singular) | Nm (Nominative) |
Fe (Feminine) | Pl (Plural) | Ge (Genitive) |
Ne (Neuter) | Ac (Accusative) | |
Da (Dative) |
Table B-5. Nm (Numeral)
Gender | Number | Case | Function |
---|---|---|---|
Ma (Masculine) | Sg (Singular) | Nm (Nominative) | Aj (Adjectival) |
Fe (Feminine) | Pl (Plural) | Ge (Genitive) | No (Nominal) |
Ne (Neuter) | Ac (Accusative) | ||
Da (Dative) | |||
Vo (Vocative) |
Table B-6. No (Noun)
Gender | Number | Case |
---|---|---|
Ma (Masculine) | Sg (Singular) | Nm (Nominative) |
Fe (Feminine) | Pl (Plural) | Ge (Genitive) |
Ne (Neuter) | Ac (Accusative) | |
Da (Dative) | ||
Vo (Vocative) |
Table B-7. Pn (Pronoun)
Gender | Person | Number | Case | Inflection |
---|---|---|---|---|
Ma (Masculine) | 01 | Sg (Singular) | Nm (Nominative) | We (Weak) |
Fe (Feminine) | 02 | Pl (Plural) | Ge (Genitive) | St (Strong) |
Ne (Neuter) | 03 | Ac (Accusative) | Xx (No Value) | |
Da (Dative) | ||||
Vo (Vocative) |
Table B-8. Vb (Verb)
Finiteness/Mood | Tense | Person | Number | Gender | Aspect | Voice | Case |
---|---|---|---|---|---|---|---|
Id (Indicative) | Pr (Present) | 01 | Sg (Singular) | Ma (Masculine) | Ip (Imperfective) | Av (Active) | Nm (Nominative) |
Mp (Imperative) | Pa (Past) | 02 | Pl (Plural) | Fe (Feminine) | Pe (Perfective) | Pv (Passive) | Ge (Genitive) |
Nf (Infinitive) | Xx (No Value) | 03 | Xx (No Value) | Ne (Neuter) | Ac (Accusative) | ||
Pp (Participle) | Xx (No Value) | Xx (No Value) | Da (Dative) | ||||
Vo (Vocative) | |||||||
Xx (No Value) |