ESPADA (Extended Syntactic Phrase Alignment DAtaset) Contact: Yuki Arase (Osaka Universtiy) arase@ist.osaka-u.ac.jp Junichi Tsujii (Artificial Intelligence Research Center (AIRC), AIST) j-tsujii@aist.go.jp We extended the syntactic phrase alignment dataset for evaluation (SPADE) for training phrase alignment models as ESPADA. The annotation standards follow the ones designed for SPADE. (SPADE is also available at LDC: https://catalog.ldc.upenn.edu/LDC2018T09) The ESPADA provides annotations of gold-standard HPSG trees by a linguist and gold-standard phrase alignments identified by three annotators of native and near-native English speakers. Consequently, 251,972 phrase alignments were identified in 1,916 sentential paraphrases. Among them, 80,572 alignments were agreed by at least two annotators, and 66,246 alignments were agreed by all annotators. * Recommended/Expected use of corpus ESPADA can be used for training/testing phrasal paraphrase detection and phrase representation models. Also, it can be used to analyse the paraphrase phenomenon; what kind of linguistic operations are occurring in phrasal paraphrases. * Data Format and specification This dataset consists of following files/directories: root/ ├ data/ ├ espada.xsd :XML schema for ESPADA dataset └ xml/ :Contains xml files of annotated sentences (3,832 files) ├s-0003.xml ├t-0003.xml ├ ... └t-3125.xml └ docs/ ├ readme.txt : This file └ guideline.pdf : Annotation guideline used for phrase alignment annotation (same with the guideline for SPADE) Each XML file presents an HPSG tree in the same format with the Enju parser (https://mynlp.is.s.u-tokyo.ac.jp/enju/). Sentential paraphrase pairs can be identified by sentence ids (similarly by file names); "s0000" and "t0000" are a paraphrase pair. We added phrase alignment annotation to constituency nodes ("cons" nodes) using "pa1", "pa2", and "pa3" attributes, each of which corresponds to alignments done by annotator #1, annotator #2, and annotator #3, respectively. Each attribute takes an id of a constituency node in another sentence that an annotator aligned. Otherwise, when a phrase is judged as it does NOT have any paraphrases, the attribute takes "-1" to represent such a null alignment. Schematic information of XML to represent HPSG tree is described in detail at https://mynlp.is.s.u-tokyo.ac.jp/enju/enju-manual/enju-output-spec.html Only the difference from their specification is that now "cons" nodes have extra attributes of these "pa1," "pa2," and "pa3." Tag | attributes sentence | id, parse_status, fom cons | id, cat, xcat, schema, head, sem_head, pa1, pa2, pa3 tok | id, cat, pos, base, tense, aspect, voice, aux, type, lexentry, pred, arg1, arg2, arg3, arg4, mod Example: <-- This sentence is a paraphrase of a sentence "t0000" ... <-- This constituency node "c10" was aligned to "c9" and "c14" by annotator #1 and annotator #2, respectively, while annotator #3 judeged it does not have any correspondances (paraphrase) in the sentence "t0000." ... * Citation When you publish a research paper using ESPADA, please cite the following: Yuki Arase and Junichi Tsujii. 2020. Compositional Phrase Alignment and Beyond. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). # The complete bibliography information will be available at ACL Anthology. @inproceedings{arase_tsujii:emnlp2020, author = {Arase, Yuki and Tsujii, Junichi}, title = {Compositional Phrase Alignment and Beyond}, booktitle = {Proceesings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2020}, month = nov }