2001 Communicator Dialogue Act Tagged
Item Name: | 2001 Communicator Dialogue Act Tagged |
Author(s): | Rashmi Prasad, Marilyn Walker |
LDC Catalog No.: | LDC2004T16 |
ISBN: | 1-58563-306-2 |
ISLRN: | 137-996-514-791-4 |
DOI: | https://doi.org/10.35111/r53v-7r46 |
Release Date: | June 15, 2004 |
Member Year(s): | 2004 |
DCMI Type(s): | Text |
Data Source(s): | telephone conversations |
Project(s): | Communicator |
Application(s): | nominal expression generation, speech recognition, spoken dialogue modeling, spoken dialogue systems, summarization, tagging, topic detection and tracking |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2004T16 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Prasad, Rashmi, and Marilyn Walker. 2001 Communicator Dialogue Act Tagged LDC2004T16. Web Download. Philadelphia: Linguistic Data Consortium, 2004. |
Related Works: | View |
Introduction
2001 Communicator Dialogue Act Tagged was produced by the Linguistic Data Consortium (LDC) and contains approximately 1.15 million words of system and user interactions with entity and dialogue act tagging.
This corpus is an addendum to the 2001 Communicator Evaluation (LDC2003S01) corpus produced by LDC in 2003. This addendum contains annotations on the transcriptions of the system and user utterances as taken from the corrected log files of the 2001 Communicator Evaluation corpus. Corrections were done manually for missing or misaligned time-stamps on turn/utterance boundaries.
Dialogue Act Annotations are provided for system utterances in the dialogues. The dialogue act tags follow the DATE (Dialogue Act Tagging for Evaluation) scheme. In addition, both system and user utterances are tagged for named entities. For further description of the 2001 Communicator Evaluation corpus, please refer to the main publication from 2003 linked above.
Data
The complete Dialogue Act annotated corpus is available as a single XML text file totalling approximately 67 MB.
Here is the breakdown for dialogues and dialogue acts:
Dialogues | Dialogue Acts | Tagged Dialogue Acts | Unique Tags |
---|---|---|---|
1,683 | 85,881 | 82,277 | 68 |
Dialogue Act tagging was done automatically using pattern matching with human-labeled dialogue utterances used by the nine different participating Communicator Systems. Named entity tagging also followed the same methodology.
Each dialogue is segmented into system and user turns. Here is a breakdown of the distribution of turns, utterances, and words:
System | User | Total | |
---|---|---|---|
Turns | 39,419 | 39,299 | 78,718 |
Utterances | 39,417 | 50,249 | 89,666 |
Words | 1,048,311 | 103,019 | 1,151,330 |
Samples
For an example of the data in this corpus, please view this sample (XML).
Sponsorship
This research was conducted using funding from the following grant number and funding agency: DARPA contract MDA972-99-3-0003.
Updates
None at this time.