This is the readme file for ISST-CoNLL, the Italian part of the CONLL-2007 Shared Task. Version: $Id: README,v 1.0 2007/26/01 $ 1. Preamble 1.1 Source The Italian dependency annotated corpus, developed for the CoNLL-2007 Shared Task, was derived from the Italian Syntactic-Semantic Treebank (ISST), a multi-layered annotated corpus of Italian which represents one of the main outcomes of a major Italian national project, SI-TAL. ISST was developed between 1999 and 2001. The project consortium included: ILC-CNR/CPR, Venezia University/CVR, ITC-IRST, "Tor Vergata" University/CERTIA and Synthema. The ISST corpus consists of 305,547 word tokens reflecting contemporary language use. It includes two different sections: 1) a "balanced" corpus, for a total of 215,606 tokens (from the newspapers La Repubblica and Il Corriere della Sera, and from a number of different periodicals, years 1985-1995); 2) a specialised corpus, amounting to 89,941 tokens, with texts belonging to the financial domain (from Il Sole-24Ore, 1994). The ISST-CoNLL corpus was developed as a joint affort of ILC-CNR (Simonetta Montemagni ) and the Dipartimento di Informatica of the University of Pisa (Maria Simi ). 1.2 Copyright and license ISST-CoNLL is copyrighted material which can be used for research purposes only and which cannot be distributed in any original or modified form (see the licence agreement form). 1.3 Web site For more information: . 2. Documentation 2.1 Data format Data adheres to the following rules:. * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the table below. Fields are separated by one tab character. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, FEATS, HEAD and DEPREL columns are used. * Data files are are UTF-8 encoded (unicode). Field 1: ID Token counter, starting at 1 for each new sentence. Field 2: FORM Word form or punctuation symbol Field 3: LEMMA Lemma of word form Field 4: CPOSTAG Coarse-grained part-of-speech tag. Based on the ILC/PAROLE tagset and conformant to the EAGLES international standard. -------------------------------- Value Description -------------------------------- A: adjective B: adverb C: conjunction D: determiner E: preposition I: interjection N: numeral P: pronoun PU: punctuation R: article S: noun SA: abbreviation V: verb X: residual class -------------------------------- Field 5: POSTAG Fine-grained part-of-speech tag. Based on the ILC/PAROLE tagset and conformant to the EAGLES international standard. For additional examples and contexts of use, see the documentation at . ------------------------------------------------------------------ Value Description Examples ------------------------------------------------------------------ A: adjective bello, buono, pauroso, ottimo AP: possessive adjective mio, tuo, nostro, loro B: adverb bene, fortemente, malissimo, domani C: conjunction e, o, ma, mentre, quando DD: demonstrative determiner questo, codesto, quello DE: exclamative determiner che, quale, quanto DI: indefinite determiner alcuno, certo, tale, parecchio, qualsiasi DR: relative determiner cui, quale DT: interrogative determiner che, quale, quanto E: preposition di, a, da, in, su, verso, prima_di I: interjection ahime', beh, ecco, grazie N: cardinal numbers uno, due, cento, mille, 28, 2000 NO: ordinal numbers primo, secondo, centesimo PD: demonstrative pronoun questo, quello, costui PI: indefinite pronoun chiunque, ognuno, molto PP: possessive pronoun mio, tuo, suo, loro, proprio PQ: personal pronoun io, egli, noi, lo, la, mi, ci, vi PR: relative pronoun che, cui, quale PT: interrogative pronoun che, chi, quanto PU: punctuation , ; . ? ! RD: determinative article il, lo, la, i, gli, le RI: indeterminative article uno, un, una S: common noun amico, insegnante, verita' SA: abbreviation ndr, a.C., d.o.c., km SP: proper noun Monica, Pisa, Fiat, Sardegna SW: foreign noun fazenda, mulieris dignitatem, weekend V: verb mangio, avere, passato, camminando X: residual class (formulas, 43'', piacce unclassified words, ...) -------------------------------------------------------------------- Field 6: FEATS This field contains an unordered set of morph-syntactic features complementing the part of speech information. The tables which follow document: 1. the association between morpho-syntactic features and part of speech information (first table); features marked between square brackets are optionally specified; 2. the typology of features and their possible values (second table). For additional examples and contexts of use, see the documentation at . ------------------------------------------------------------- POS classes Features ------------------------------------------------------------- A gen, num, [sup] AP gen, num B [sup] E [gen], [num] DD, DE, DI, DR, DT gen, num N, NO [gen], [num] P, PD, PI, PP, PR, PT gen, num PQ gen, num, per RD, RI gen, num SA gen, num S, SP, SW gen, num V [gen], [num], [per], mod, [tmp] ------------------------------------------------------------- ----------------------------------------------------------------------- Feature Value Description Examples ----------------------------------------------------------------------- gen M masculine caso (S) F feminine giustizia (S) N underspecified ospite (S) ----------------------------------------------------------------------- num S singular casa (S) P plural donne (S) N underspecified le (PQ) ----------------------------------------------------------------------- mod G gerundive dedicando F infinitive essere I indicative trovava C subjunctive possegga D conditional sarebbero M imperative cercate P participle rapita ----------------------------------------------------------------------- per 1 first person possiamo (V) 2 second person sapete (V) 3 third person le (PQ), vede (V) ----------------------------------------------------------------------- tmp P present tense ha F future tense sara' I imperfect tense trovava R past tense rapita ----------------------------------------------------------------------- sup S superlative gravissimi (A), benissimo (B) ----------------------------------------------------------------------- Fields 7: HEAD Non-projective head of current token, which is either a value of ID or zero ('0') Field 8: DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. The table which follows documents the 21 dependency tags used for corpus annotation. For examples and contexts of use see the documentation at . ------------------------------------------------------------------------ Value Relation type Description ------------------------------------------------------------------------ arg: argument The most generic relation between a head and a subcategorized argument. Also used between a verbal head and a non-subject clausal argument. Ex. Le autorita' hanno il blitz e' concluso. ------------------------------------------------------------------------ aux: auxiliary The relation between a verb and its auxiliary. Ex. Il corazziere e' ------------------------------------------------------------------------ clit: clitic The relation between a clitic pronoun and a verbal head in pronominal form. Ex. La sedia rotta ------------------------------------------------------------------------ comp: complement The most generic relation between a head and a complement. For instance: a) the relation between a head and the agent in the passive construction Ex. Fu un pazzo b) the relation between the compared item and the comparative complement in comparative constructions. Ex. E' piu' libro ------------------------------------------------------------------------ con: copulative conjunction In coordinate structures, holds between the first conjunct, which is taken to be the head of the whole structure and a a copulative conjunction. Ex. Una ragazza sequestrata da due slavi ------------------------------------------------------------------------ concat: concatenation Holds between tokens forming complex word forms (e.g. complex proper nouns, multi-word expressions and the like). Ex. Il segretario di ------------------------------------------------------------------------ cong: conjunct Links the (second, third, ...) conjuncts to the first conjunct. Used in association with 'con'. Ex. Una ragazza e da due slavi ------------------------------------------------------------------------ cong_sub: subordinate conjunct Holds between a subordinative conjunction and the verbal head of its clausal complement. Ex. Ha detto non fare nulla ------------------------------------------------------------------------ det: determiner The relation holding between a nominal head and its determiner. Ex. Rilevata di gas ------------------------------------------------------------------------ dis: disjunctive conjunction Holds between a disjunctive conjunction in coordinate structures and the first conjunct, which is taken to be the head of the whole coordinate structure. Ex. Cassonetti dell'immondizia incendiati ------------------------------------------------------------------------ disg: disjunct Links the (second, third, ...) conjuncts to the first conjunct. Used in association with 'dis'. Ex. Cassonetti dell'immondizia o ------------------------------------------------------------------------ mod: modifier Holds between a head and its modifier, whether clausal or non-clausal. Ex. I colori gli stessi Ex. Ex. urla non mi ------------------------------------------------------------------------ mod_rel: relative modifier Holds between the verbal head of a relative clause and its nominal head in the higher clause. Ex. che e' stato nel pomeriggio Used also to link the verbal head of a free relative to the "chi" pronoun (which in turn is directly linked to its governor) Ex. Non e' mai stato accertato la sua morte ------------------------------------------------------------------------ modal: modal verb Holds between a verbal head and a modal verb. Ex. Una sala ha essere ------------------------------------------------------------------------ obl: oblique Holds between a verbal head and a subcategorized non-direct, non indirect and non-clausal complement. Ex. Si e' gas ------------------------------------------------------------------------ ogg_d: direct object Holds between a verbal head and its direct object (always non-clausal). Ex. la di gas ------------------------------------------------------------------------ ogg_i: indirect object Holds between a verbal or nominal head and the indirect object, i.e. the complement expressing the recipient or beneficiary of the action expressed by the verb or the noun. Ex. magistrato ------------------------------------------------------------------------ pred: predicate Holds between a head and a predicative complement, be it subject or object predicative. Ex. L'incontro e' ------------------------------------------------------------------------ prep: preposition Holds between a prepositional head and its complement, whether clausal or non-clausal. Ex. Un contributo ------------------------------------------------------------------------ punc: punctuation Holds between a word token and a punctuation mark. Ex. Teatro della ... ------------------------------------------------------------------------ sogg: subject Holds between a verb and its subject: 1. superficial subjects in active or passive voice Ex. Il ha subito 2. clausal subjects Ex. opportuno due parole ------------------------------------------------------------------------ Field 9: PHEAD Projective head of current token, which is always an underscore, because it is not available for the ISST-CoNLL Italian treebank Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not available for the ISST-CoNLL Italian treebank. 2.2 ISST-CoNLL corpus composition ISST-CoNLL is a subset of the balanced ISST corpus of 79654 word tokens (of which 65016 are non punctuation tokens) for a total 4162 sentences, corresponding to the Corriere della Sera and periodicals partitions of ISST. ------------------------------- Statistics ------------------------------- #sentences 4162 #tokens 79654 #non-punct tokens 65016 #coarse pos tags 14 #fine pos tags 28 #deprels 21 ------------------------------- 2.4 Conversion Conversion from the ISST corpus consisted in: a) combining information coming from two different annotation levels b) converting the ISST annotation scheme for dependency annotation into the CoNLL-2007 format. Conversion had to cope with the fact that in ISST dependency relations are expressed in terms of binary relations holding between two lexical heads belonging to major lexical classes only (i.e. non-auxiliary verbs, nouns, adjectives and adverbs): in fact, in ISST information about grammatical words (e.g. determiners, prepositions, auxiliaries) is encoded in terms of features associated with the participants to the relation. During the conversion process the dependency relations involving grammatical words had to be reconstructed from the ISST original annotation and the already existing dependency relations had to be revised accordingly. This was done semi-automatically by means of several conversion scripts whose output has been manually revised with the help of a graphical annotation tool. Further scripts were run to validate the consistency of the final output. An XML intermediate format was produced in this process, preserving original annotations that could not be accomodated in the CoNLL format. 3. Acknowledgements Isidoro Barraco and Patrizia Topi did most of the work of writing conversion scripts and revising tags. Kiril Ribarov, Alessandro Lenci and Giuseppe Attardi contributed useful discussions on critical issues. 4. References Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte (2003a) "Building the Italian Syntactic-Semantic Treebank", in Anne Abeillé (ed.), Building and using Parsed Corpora, Language and Speech series, Kluwer, Dordrecht, pp. 189-210. Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Vito Pirrelli, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte (2003b) "The syntactic-semantic treebank of Italian. An overview", Linguistica Computazionale XVI-XVII, pp. 461-492