README FILE FOR THE CALLFRIEND FARSI TRANSCRIPT CORPUS
LDC Catalog-ID: LDC2014T01


1. Introduction and Background

This corpus contains transcripts created from 100 telephone
conversations among native speakers of Farsi.  These calls were
recorded by the Linguistic Data Consortium in 1995-6 as part of the
CallFriend (CF) collection, which was designed primarily to support
research in automatic language identification.  One hundred native
Farsi speakers living the the continental U.S. were recruited and
offered incentives to make a single phone call, lasting up to 30
minutes, to a family member or friend living anywhere else in the
U.S.

Audio data for 60 of the calls were released, without transcripts, in
the LDC's 1996 membership year (corpus catalog-ID LDC96S50), and the
full set of 100 calls is being released concurrently with this set of
transcripts (LDC2013S..).  All CF recordings involved domestic calls
routed through the LDC's call collection platform, and were stored as
2-channel ("4-wire"), 8-KHz mu-law samples taken directly from the
public telephone network via a T-1 circuit.

In 2000-1, the LDC employed a small group of Farsi speakers to
transcribe the 100 CF Farsi calls, to support research in automatic
speech recognition.  Transcribers were instructed to use a romanized
(Latin-based), phonemic orthography, which was developed specifically
for Farsi, for two reasons: (a) we wanted to ensure that all vowels
would be represented consistently, and (b) at the time, the available
software tools for keyboarding and displaying text in Arabic script
were considered insufficient and/or too difficult for use in this
project.  The project ended without creating a process to convert the
romanized text to the standard, Arabic-based orthography used natively
by Farsi speakers, and partly because of this, the transcript corpus
was not published for general access, but was released only to a small
number of researchers.

In 2012, the DARPA "RATS" program elected to use the CF Farsi corpus,
both speech and transcripts.  In order to support the RATS research
tasks, the LDC contacted a research group who had addressed the
problem of converting the text to Arabic-script orthography, and
acquired a word list that mapped original transcript word forms to
their Arabic-script correlates.  This list didn't cover all the word
forms in the original text corpus, so an annotation task was conducted
as part of the LDC's overall effort in RATS in order to produce fully
Arabicized text for all 100 transcripts, and these were made available
to researchers in the RATS program.

In preparing the corpus for general release, we have also reviewed the
portions in each conversation that were marked by the original
transcribers as code-switching into English (i.e. the occasional use
of English words or phrases in the Farsi conversations), to rectify
English word spellings.


2. Directory Structure

   docs/  -- contains these files:
 
     recruit_demog.tab -- demographic data for recruited callers
     transcript_spec.txt -- describes transcript structure and markup
     transcript_stats.tab -- summary of transcript contents
     xml_transcript.dtd -- DTD for xml version of transcripts

   data/  -- contains three sub-directories, each with 100 files:

     asc/ -- original romanized transcripts as flat-table text files
       	         (fa_####_asc.txt)

     ntv/ -- Arabic-script transcripts as flat-table text files
       	         (fa_####_ntv.txt)

     xml/ -- both romanized and Arabic forms in a simple XML format
       	         (fa_####.xml)

In all the data files, the four-digit portion of the file name is a
numeric call-ID, used across all forms of data (text and audio) from a
given conversation.


3. Data File Formats

3.1  data/{asc,ntv}_txt/fa_*.txt

The format of the fa_*.txt files is similar to CallHome transcript
files, except that the four main fields on each line (start-offset,
end-offset, speaker-label, transcript-text) are separated by tabs
rather than spaces.  Each file begins with a single "comment" line
containing the file_id string - e.g.:

  # fa_4099

This is followed immediately by the list of time-stamped segments, in
order according to their start-offset values, with no blank lines.
Two details are worth noting:

 - While start-offset values are in ascending order, end-offset values
   might not be; a long segment on one channel may be followed by, and
   end later than, one or more short segments on the other channel.

 - In the "ntv" (Arabicized) version of the text, all the transcript
   tokens are arranged in logical order on each line, but because many
   lines have both Arabic (Farsi words) and ASCII (annotation tokens),
   text-display tools that try to support bi-directional text may not
   be able to put the transcript tokens in the expected display order,
   and/or may have trouble with the placement and direction of bracket
   characters in the in-line markup.  It's possible to add Unicode
   directionality control characters to the text in order to get a
   proper display (or convert the file content into a form that can be
   displayed correctly in a browser), but nothing of that sort has
   been done in this release of the data.


3.2  data/xml/fa_*.xml

The XML form of the transcripts contains both Arabicized and romanized
forms for Farsi words.  The basic XML structure is as follows
(attribute values in parentheses refer to notes below; line breaks
have been inserted among the "token" attributes for legibility):

 <conversation id="fa_####">
  <segment docid="fa_####" speaker="(a)" ch="(a)" end="(b)" start="(b)">
   <token
      raw="(c)"
      type="(d)"
      clean="(e)"
      solution="(f)" soltype="(f)"
      lang="(g)"/>
   <token ...>
    <annotator_comment>(h)</annotator_comment>
   </token>
  </segment>
  ...
 </conversation>

Notes:

(a) There is a distinct label for each speaker present in the
transcript; this attribute value matches column 3 of the plain-text
format, minus the ':' character (3.1).  The "ch" attribute is "0" for
channel A (first or 'left' channel), and "1" for channel B.

(b) start and end are in seconds, and match the first two columns of
the plain-text format.

(c) raw is the original token as presented in the "*_asc.txt" version;
this may be a markup tag ("<as>", "</as>", "{cough}", etc), a Farsi
word (possibly with a "token-type" marker - see transcript_spec.txt),
or a code-switched word or phrase in English ("<English bye bye>").

(d) type is one of the following:
  "normal" (Farsi word token)
  "propernoun" (Farsi name, marked with "&" in the raw value)
  "interjection" (marked with "%" in the raw value)
  "markup" (for "<as>", etc)
  "foreign" (e.g. "&lt;English bye bye &gt;" -note character entities)
  "speaker_noise" ("{cough}", etc)
  "unintelligible" (always rendered as "(( ))")

(e) clean is a version of "raw" minus certain token markers and
bracketing (this attribute is not present if type="unintelligible")

(f) solution and soltype are only present for Farsi words, and provide
the native Farsi orthography and the provenance of that form (either
"lexicon" or "manual")

(g) lang only occurs when type="foreign"; it's value happens to always
be "English".

(h) Annotator_comment elements appear in only 181 tokens; a few
consist of English comments, but most are in native Farsi orthography.


3.3  docs/recruit_demog.tab

In this flat-table, tab-delimited, plain-text file, the first line
contains column headings, as listed below:

     1	file_id  (fa_####)
     2	gender   M or F
     3	educ     number of years of formal education
     4	age      age in years at time of collection
     5	raised   city or location where raised

This represents all the demographic information available for
speakers in the corpus; it exists only for the people who were
recruited to make calls - no information is available for the
callees.  The recruited caller appears on side "A" (first or 'left'
channel of each recorded call), but some calls have multiple speakers
on that side (as detailed in the "transcript_stats.tab" file,
described below).


3.4  docs/transcript_stats.tab

In this flat-table, tab-delimited, plain-text file, the first line
contains the column headings, as listed below (there are footnotes
below the list for items marked by "*"):

     1  file_id  (fa_####)
     2  bgn      offset in seconds to start of first transcribed segment
     3  end      offset in seconds to end of last transcribed segment
     4  span     total duration in seconds covered by the transcript
     5  sphsec   total sum in seconds of segment durations *(a)
     6  segs     number of segments in the transcript
     7  ascTkns  number of space-separated tokens in romanized text *(b)
     8  ntvTkns  number of space-separated tokens in Arabicized text *(b)
     9  gender   gender of speakers in the transcripts *(c)
    10  A_nspk   number of speakers transcribed on channel A
    11  A_segs   number of segments for each A speaker *(d)
    12  B_nspk   number of speakers transcribed on channel B
    13  B_segs   number of segments for each B speaker *(d)

Notes:

*(a) In 63 of the 100 files, the sum of segment durations adds up to
more than the total span of audio covered by the transcript; this
indicates a lot of overlap between the segments of the two channels in
the call, which is due partly to speakers talking over each other, and
partly to the tendency of the transcribers to include variable margins
of non-speech regions when setting the boundaries of the segments to
be transcribed.

*(b) The ASCII and Arabicized token counts include not only the spoken
words, but also annotation tokens (i.e. in-line markup, indicators of
audible non-speech events, etc, described in 'transcript_spec.txt').
The "native" token count is always higher than the ASCII because there
are a variety of cases where an item rendered as a single romanized
word form by transcribers was converted to a string of two or more
words in native Farsi orthography.

*(c) The information available about the speakers is limited to what
can be determined through listening to the recordings.  The amount of
detail derived from auditing was very limited, and the "gender" column
only records whether or not speakers are of mixed gender, and if not,
which gender is common to all speakers in the call.

*(d) Columns 11 and 13 are variable length, with internal structure.
If there is only one speaker on a given channel, the number of
segments for that speaker is shown as "A:"=># or "B:"=># (where "#" is
a number, usually greater than 100).  If there are multiple speakers
present on a given channel, different labels are given for each one,
along with number of segments attributed to each - e.g.:

 file_id  ... segs ...  A_segs              ...  B_segs
 ...
 fa_7014  ... 456  ...  "A:"=>210,"A2:"=>1  ...  "B:"=>148,"B2:"=>97

These fields contain no spaces (only commas between speakers,
double-quotes around speaker labels, and '=>' as a as a separator
between the speaker label and the segment count).


4.0 Known Issues

- We do not have demographic data on the recruited caller for file_id
  fa_5758.

- The transcription of file_id fa_7003 is incomplete, containing
  utterances for speaker "A:" only (none of channel B was transcribed
  in this call).


------------------
README file created by David Graff, July 29, 2013.