README:

These data have been used to run the textual entailment experiments described in:

Sara Tonelli and Elena Cabrio "Hunting for Entailing Pairs in the Penn Discourse Treebank", in Proceedings of Coling 2012, Mumbay, India.

The files contain Text - Hypothesis pairs in the standard RTE xml format (for more details, see http://www.nist.gov/tac/2011/RTE/), which have been manually annotated as entailing or not entailing. All sentence pairs have been extracted from the Penn Discourse Treebank, and are therefore connected by a discourse relation label. Such label corresponds to the value assigned to the "task" attribute in the xml files.

Description of the files:

- 'train.xml' and 'test.xml': training and test sets used for running the classification experiment described in Section 4.3.2. of the above paper. 
'train.xml' contains 160 pairs, balanced with respect to positive and negative pairs, while 'test.xml' contains 100 pairs. All pairs are connected through some kind of "Restatement" relation (see Penn Discourse Treebank manual), specified by the value of the "task" attribute. 

- 'train_anaphoraresolved.xml' and 'test_anaphoraresolved.xml': the same files described above, in which anaphoric expressions have been manually resolved.


For further questions and remarks, please contact:
satonelli [at] fbk.eu
elena.cabrio [at] inria.fr