DEFT Spanish Treebank CatalogID: LDC2018T01 Release date: January 5, 2016 Authors: Mariona Taulé, M. Antonia Martí, Ann Bies, Aina Garí, Montserrat Nofre, Zhiyi Song, Stephanie Strassel, Joe Ellis 1. Introduction This package is the complete cumulative release of DEFT Spanish Treebank annotation. It contains treebank annotation of International Spanish Newswire data and also Latin American Spanish Discussion Forum data, annotated according to AnCora Spanish Treebank guidelines for the DEFT project. The AnCora guidelines have been adapted to account for web text phenomena in the discussion forum data following similar adaptations made to the English Treebank annotation guidelines at LDC. This release contains 109,701 annotated tokens. DARPA's Deep Exploration and Filtering of Text (DEFT) program aims to improve state-of-the-art capabilities in automated deep natural language processing, with a particular focus on technologies dealing with inference, causal relationships, and anomaly detection across several languages. The Spanish Treebank annotation in this package is in support of DEFT's goal of deep natural language understanding. This package includes 1. 114 files of international Spanish newswire (NW) data, fully annotated with both constituents and syntactic functions, corresponding to 54,394 tokens. This is the complete NW set of Spanish Treebank data for DEFT, and was previously released to the DEFT community as LDC2014E130, DEFT International Spanish NW Treebank V2.0 (and as the NW protion of LDC2015E66, DEFT Spanish Treebank V1). 2. 60 files of Latin American Spanish discussion forum (DF) data, fully annotated with both constituents and syntactic functions, corresponding to 55,307 tokens. This is the complete DF set of Spanish Treebank data for DEFT, the first increment of which was previously released to the DEFT community as the DF portion of LDC2015E66, DEFT Spanish Treebank V1. Newswire source files were selected from Spanish Gigaword previously released in LDC2011T12, and were selected to be in the international Spanish newswire genre, and were manually sentence (SU) segmented for DEFT. Discussion forum source files were selected from Spanish DF source data collected by LDC and previously released in LDC2014E14, and were selected to be Latin American Spanish in continuous multi-posts (CMPs) of 100-1000 words. 2. Contents ./README This file docs/ A listing of all of the files in this release can be found in docs/file.tbl. A listing of the base data filenames can be found in docs/file.ids. Supplemental treebank annotation guidelines for the data can be found in following files in the docs/ directory: LAS-DF_Guidelines_2015.pdf, Addenda_to_guidelines-en.pdf, List_of_Emoticons.pdf, Multiwords_2015.pdf. A paper about the development of the DF portion of this treebank can be found here: docs/Paper_NLPIT-2015-LADF.pdf Spanish Treebank Annotation of Informal Non-Standard Web Text. 2015. Mariona Taulé, M Antonia Martí, Ann Bies, Aina Garí, Montserrat Nofre, Zhiyi Song, Stephanie Strassel and Joe Ellis. 1st International Workshop on Natural Language Processing for Informal Text (NLPIT 2015), at the International Conference on Web Engineering (ICWE 2015). data/ data/su_annotated_source/ contains directories for DF (discussion forum) and NW (newswire) source documents that have been manually sentence (SU) segmented. The files are plain text files (.txt) with one sentence unit per line. data/su_annotated_source/DF/.txt data/su_annotated_source/NW/.txt data/ancora_treebank/ contains the manually treebank annotated files in directories for DF (discussion forum) and NW (newswire) genres. The files are .xml files. data/ancora_treebank/DF/-iso.tbf.xml data/ancora_treebank/NW/-iso.tbf.xml 3. Notes on sentence segmentation and part-of-speech (POS) tags The data was automatically sentence segmented, and the resulting sentence segments (SUs) were manually corrected at LDC. Treebank annotation treats each sentence segment (or sentence unit (SU)) as a separate source sentence and annotates it accordingly. SUs are not combined. If it is necessary to annotate more than one sentence syntactically within an SU, such sentences are included under a top node that contains the entire original SU. Spellings, etc. are as in the source and are not corrected. As part of the treebank annotation, every token should have a POS tag. Note that emoticons and symbols (such as the copyright symbol) have POS=word, which means that they do not have a morphosyntactic category assigned. Also note that when a source text sentence does not include final punctuation, the annotation adds an elliptic node with the labels name=fp, elliptic=yes and anomaly=yes, without a POS tag. 4. Notes on treebank annotation This data was manually annotated according to AnCora Spanish Treebank morphological and syntactic annotation guidelines, with both constituents and syntactic functions. For the DF data, the AnCora Spanish Treebank guidelines have been adapted to account for web text phenomena following similar adaptations made to the English Treebank annotation guidelines at LDC. The original AnCora annotation guidelines can be found at the following locations: Civit, M. (2003) Criterios de etiquetación y desambiguación morfosintàctica de corpus en español. En: Sociedad Española para el Procesamiento del Lengaje Natural. Colección de Monografías, num. 3 (http://www.sepln.org/wp-content/uploads/2011/02/monografiaCivit.pdf) Soriano, B., O. Borrega, M. Taulé and M.A. Martí (2008) Guidelines, 3LB-WP-02-03, Universitat de Barcelona. (http://clic.ub.edu/corpus/webfm_send/17) Supplemental treebank annotation guidelines for the data can be found in following files in the docs/ directory of this package: LAS-DF_Guidelines_2015.pdf, Addenda_to_guidelines-en.pdf, List_of_Emoticons.pdf, Multiwords_2015.pdf. A paper about the development of the DF portion of this treebank can be found here: docs/Paper_NLPIT-2015-LADF.pdf 5. Note on overlap with Spanish Entities, Relations, and Events (ERE) annotation The source data that is treebanked in this package largely overlaps with the source Spanish data that is currently being annotated at LDC for Light and Rich ERE in DEFT. 6. Contact Information Ann Bies ------------------- README Update Log Created: Ann Bies, May 19, 2015 Update: Ann Bies, August 11, 2015 Update: Ann Bies, December 9, 2015 Update: Ann Bies, January 5, 2016