SPADE (Syntactic Phrase Alignment Dataset for Evaluation) Contact: Yuki Arase (Osaka Universtiy) arase@ist.osaka-u.ac.jp Junichi Tsujii (Artificial Intelligence Research Center (AIRC), AIST) j-tsujii@aist.go.jp We created the SPADE (Syntactic Phrase Alignment Dataset for Evaluation) for evaluation on syntactic phrase alignment in paraphrasal sentences. The SPADE provides annotations of gold HPSG trees by a linguistic expert and gold phrase alignments identified by three annotators. Consequently, 20,276 phrases are extracted from 201 sentential paraphrases, on which 15,721 alignments are obtained that at least one annotator regarded as paraphrases. These pairs are separated into development (50 pairs) and test (151 pairs) sets. For more details, please refer to our papers below. This dataset consists of following files/directories: root/ ├ readme.txt : This file ├ arase_tsujii_lrec2018.pdf : Our paper on this SPADE daset ├ guideline.pdf : Annotation guideline used for phrase alignment annotation ├ dev.txt :Sentential paraphrase pairs (tab-separated) in the development set, a line number corresponds to the sentence index ├ test.txt :Sentential paraphrase pairs (tab-separated) in the test set, a line number corresponds to the sentence index ├ dev/ :Contains xml files of the development set (100 files) ├s-001.xml ├t-001.xml ├ ... └t-050.xml └ test/ :Contains xml files of the test set (302 files) ├s-001.xml ├t-001.xml ├ ... └t-151.xml Each xml file presents an HPSG tree in the same format with the Enju parser (http://www.nactem.ac.uk/enju/). Sentential paraphrase pairs can be identified by sentence ids (similarly by file names); "s1" and "t1" are a paraphrase pair. We added phrase alignment annotation to constituency nodes ("cons" nodes) using "pa1", "pa2", and "pa3" attributes, each of which corresponds to alignments done by annotator #1, annotator #2, and annotator #3, respectively. Each attribute takes an id of a constituency node in another sentence that an annotater aligned. Otherwise, when a phrase is judged as it does NOT have any paraphrasal phrase, the attribute takes "-1" to represent such a null alignment. Schematic information of XML to represent HPSG tree is described in details at http://www.nactem.ac.uk/enju/enju-manual/enju-output-spec.html. Only the difference from their specification is that now “cons” nodes have extra attributes of these “pa1,” “pa2,” and “pa3.” Tag | attributes sentence | id, parse_status, fom cons | id, cat, xcat, schema, head, sem_head, pa1, pa2, pa3 tok | id, cat, pos, base, tense, aspect, voice, aux, type, lexentry, pred, arg1, arg2, arg3, arg4, mod Example: <-- This sentence is a paraphrase of a sentence "t1" ... <-- This constituency node "c10" was aligned to "c9" and "c14" by annotator #1 and annotator #2, respectively, while annotator #3 judeged it does not have any correspondances (paraphrase) in the sentence "t1." ... When you publish your research paper using the SPADE dataset, please add references to the following two papers: Yuki Arase and Junichi Tsujii. 2017. Monolingual Phrase Alignment on Parse Forests, In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1-11, Copenhagen, Denmark. http://aclweb.org/anthology/D17-1001 Yuki Arase and Junichi Tsujii. 2018. SPADE: Evaluation Dataset for Monolingual Phrase Alignment, In Proceedings of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan. @inproceedings{arase_tsujii:emnlp2017, author = {Arase, Yuki and Tsujii, Junichi}, title = {Monolingual Phrase Alignment on Parse Forests}, booktitle = {Proceesings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)}, pages = {1--11}, url = {http://aclweb.org/anthology/D17-1001}, year = {2017}, month = {September} address = {Copenhagen, Denmark}, } @inproceedings{arase_tsujii:lrec2018, author = {Arase, Yuki and Tsujii, Junichi}, title = {{SPADE}: Evaluation Dataset for Monolingual Phrase Alignment}, booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)}, year = {2018}, month = {May} address = {Miyazaki, Japan}, }