CALLFRIEND Farsi Second Edition Transcripts


Item Name: CALLFRIEND Farsi Second Edition Transcripts
Authors: Alexandra Canavan, George Zipperlen, David Graff
LDC Catalog No.: LDC2014T01
ISBN: 1-58563-667-3
Release Date: Jan 15, 2014
Data Type: text
Data Source(s): telephone conversations
Project(s): LID
Application(s): language identification
Language(s): Iranian Persian
Language ID(s): pes
Distribution: Web Download
Member fee: $0 for 2014 members
Non-member Fee: US $1000.00
Reduced-License Fee: N/A
Extra-Copy Fee: US $
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Alexandra Canavan, George Zipperlen, David Graff
2014
CALLFRIEND Farsi Second Edition Transcripts
Linguistic Data Consortium, Philadelphia

Introduction

CALLFRIEND Farsi Second Edition Transcripts was developed by the Linguistic Data Consortium (LDC)and consists of transcripts for approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The calls were recorded in 1995 and 1996 as part of the CALLFRIEND collection, a project designed primarily to support research in automatic language identification. One hundred native Farsi speakers living in the continental United States made a single telephone call, lasting up to 30 minutes, to a family member or friend living in the United States.

Corresponding speech data is available as CALLFRIEND Farsi Second Edition Speech (LDC2014S01).

Data

Transcripts are presented in three formats: romanized transcripts (*asc.txt), Arabic-script transcripts (*ntv.txt) and both romanized and Arabic forms in a simple XML format (*.xml). For the *.txt files, the four main fields on each line (start-offset, end-offset, speaker-label, transcript-text) are separated by tabs. Each file begins with a single comment line containing the file_id string. This is followed immediately by the list of time-stamped segments, in order according to their start-offset values, with no blank lines. The XML form of the transcripts contains both Arabicized and romanized forms for Farsi words.

Samples

Please view the following samples.

  • Romanized Text
  • Arabic Script Text
  • XML
  • Updates

    None at this time.

    Content Copyright

    Portions 1995-1996, 2000-2001, 2012-2014 Trustees of the University of Pennsylvania