This is the readme file for the Basque TB in CONLL 2007 Version: README, v 0.1 2007/01/26 1. Source The 3LB Treebank (basque, catalan and spanish). See http://www.dlsi.ua.es/projectes/3lb/index_en.html Description: The 3LB project selected 3,700 sentences totaling 56,000 words, taken from both literary (XXth century balanced corpus) and newspaper texts (1999-2000). 2. CoNLL specifics - Data format adheres to the standard format for all languages See http://depparse.uvt.nl/depparse-wiki/DataFormat - The data contains for each word: * position in the sentence, starting from 1 * word-form * lemma, * coarse grained part-of-speech, * fine grained part-of-speech, * a set of morphological features, including number, case, tense, aspect, ... * head of each token and * dependency relation - The coarse and fine grained parts of speech are described in: * Aduriz I., Agirre E., Aldezabal I., Alegria I., Arregi X., Arriola J., Artola Zubillaga X., Gojenola K., Sarasola K. 2000 A Word-grammar based morphological analyzer for agglutinative languages Proceedings of the International Conference on Computational Linguistics. COLING 2000 Saarbrucken, Germany. August, 2000 http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1014385332/publikoak/coling.pdf There are 16 different main parts of speech (coarse postag). As there can be internal noun ellipsis inside word-forms, this number is extended, for example: IZE(noun) --> IZE_IZEELI which means "noun with internal noun ellipsis" ADJ --> ADJ_IZEELI which means "adjective with internal noun ellipsis" ADT(verb) --> ADT_IZEELI which means "verb with internal noun ellipsis" These words share the properties of both the first POS and those of the final noun. Although the number of elliptical elements is not bounded in theory, in practice it is limited to one or two ellipsis at most: IZE_IZEELI_IZEELI (two elliptical nouns). Each coarse POS can have 3/4 fined grained POS. - The head and dependency relation fields were annotated using the model described in: - The most detailed public guide is written in Spanish: http://ixa.si.ehu.es/Ixa/Argitalpenak/Barne_txostenak/1068549887/publikoak/guia.pdf - These papers also give a less detailed general overview of the dependency annotation: * Aduriz I., Aranzabe M., Arriola J., Atutxa A., Diaz de Ilarraza A., Garmendia A., Oronoz M. 2003 Construction of a Basque Dependency Treebank TLT 2003. Second Workshop on Treebanks and Linguistic Theories, Vaxjo, Sweden, November 14-15. http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1069442222/publikoak/TLT2003_pdf * Aranzabe M., Arriola J.M., Diaz de Ilarraza 2004 Towards a Dependency Parser of Basque Proceedings of the Coling 2004 Workshop on Recent Advances in Dependency Grammar. Geneva, Switzerland. http://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1097683503/publikoak/camara-ready-basqueDP Contact: koldo.gojenola@ehu.es 3. Achknowledgement coordinators: Arantxa Diaz de Ilarazza, Koldo Gojenola Galletebeitia