Home › Language Resources › Data

Columbia Games Corpus

Item Name:	Columbia Games Corpus
Author(s):	Julia Hirschberg, Agustin Gravano, Stefan Benus, Gregory Ward, Elisa Sneed German
LDC Catalog No.:	LDC2021S02
ISBN:	1-58563-960-5
ISLRN:	834-843-130-497-9
DOI:	https://doi.org/10.35111/ayn3-sp31
Release Date:	March 15, 2021
Member Year(s):	2021
DCMI Type(s):	Sound, Text
Sample Type:	pcm
Sample Rate:	16000
Data Source(s):	microphone conversation
Application(s):	discourse analysis, prosody
Language(s):	English
Language ID(s):	eng
License(s):	Columbia Games Corpus Agreement
Online Documentation:	LDC2021S02 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Hirschberg, Julia, et al. Columbia Games Corpus LDC2021S02. Web Download. Philadelphia: Linguistic Data Consortium, 2021.
Related Works: Hide	View relatesTo LDC2023S07 LDC Spoken Language Sampler - Sixth Release

Introduction

Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic transcripts and annotation. Speech recordings are comprised of two subjects playing a series of computer games requiring verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points.

Each player used a separate laptop computer and could not see the screen of the other player. Participants played two games: the Cards Game and the Objects Game. In the Cards Game, one participant described a card and depending on the task in the game, the second participant searched for the described card or tried to match it from cards shown on their screen. In the Objects Game, each player's screen displayed 5-7 objects, one of which was the target object. One player described the target object's location on their screen, and the other player tried to move that object to the same position on their screen.

Data

Over 12 sessions conducted in 2004, 13 subjects (six female, seven male) participated in the collection. Sessions contained an average of 45 minutes of dialogue.

Each recording has corresponding manually time-aligned orthographic transcripts, affirmative cue words discourse annotation, and turn-taking annotation. Annotation guidelines are included in this release. Task files for each game are also included for each recording.

Audio data was recorded at a sample rate of 48kHz with 16-bit precision, and later converted to 16kHz, single channel FLAC compressed WAV. All text data is encoded in UTF-8.

Columbia Games Corpus

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees