README FILE FOR THE CALLFRIEND FARSI TRANSCRIPT CORPUS LDC Catalog-ID: LDC2014T01 1. Introduction and Background This corpus contains transcripts created from 100 telephone conversations among native speakers of Farsi. These calls were recorded by the Linguistic Data Consortium in 1995-6 as part of the CallFriend (CF) collection, which was designed primarily to support research in automatic language identification. One hundred native Farsi speakers living the the continental U.S. were recruited and offered incentives to make a single phone call, lasting up to 30 minutes, to a family member or friend living anywhere else in the U.S. Audio data for 60 of the calls were released, without transcripts, in the LDC's 1996 membership year (corpus catalog-ID LDC96S50), and the full set of 100 calls is being released concurrently with this set of transcripts (LDC2013S..). All CF recordings involved domestic calls routed through the LDC's call collection platform, and were stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. In 2000-1, the LDC employed a small group of Farsi speakers to transcribe the 100 CF Farsi calls, to support research in automatic speech recognition. Transcribers were instructed to use a romanized (Latin-based), phonemic orthography, which was developed specifically for Farsi, for two reasons: (a) we wanted to ensure that all vowels would be represented consistently, and (b) at the time, the available software tools for keyboarding and displaying text in Arabic script were considered insufficient and/or too difficult for use in this project. The project ended without creating a process to convert the romanized text to the standard, Arabic-based orthography used natively by Farsi speakers, and partly because of this, the transcript corpus was not published for general access, but was released only to a small number of researchers. In 2012, the DARPA "RATS" program elected to use the CF Farsi corpus, both speech and transcripts. In order to support the RATS research tasks, the LDC contacted a research group who had addressed the problem of converting the text to Arabic-script orthography, and acquired a word list that mapped original transcript word forms to their Arabic-script correlates. This list didn't cover all the word forms in the original text corpus, so an annotation task was conducted as part of the LDC's overall effort in RATS in order to produce fully Arabicized text for all 100 transcripts, and these were made available to researchers in the RATS program. In preparing the corpus for general release, we have also reviewed the portions in each conversation that were marked by the original transcribers as code-switching into English (i.e. the occasional use of English words or phrases in the Farsi conversations), to rectify English word spellings. 2. Directory Structure docs/ -- contains these files: recruit_demog.tab -- demographic data for recruited callers transcript_spec.txt -- describes transcript structure and markup transcript_stats.tab -- summary of transcript contents xml_transcript.dtd -- DTD for xml version of transcripts data/ -- contains three sub-directories, each with 100 files: asc/ -- original romanized transcripts as flat-table text files (fa_####_asc.txt) ntv/ -- Arabic-script transcripts as flat-table text files (fa_####_ntv.txt) xml/ -- both romanized and Arabic forms in a simple XML format (fa_####.xml) In all the data files, the four-digit portion of the file name is a numeric call-ID, used across all forms of data (text and audio) from a given conversation. 3. Data File Formats 3.1 data/{asc,ntv}_txt/fa_*.txt The format of the fa_*.txt files is similar to CallHome transcript files, except that the four main fields on each line (start-offset, end-offset, speaker-label, transcript-text) are separated by tabs rather than spaces. Each file begins with a single "comment" line containing the file_id string - e.g.: # fa_4099 This is followed immediately by the list of time-stamped segments, in order according to their start-offset values, with no blank lines. Two details are worth noting: - While start-offset values are in ascending order, end-offset values might not be; a long segment on one channel may be followed by, and end later than, one or more short segments on the other channel. - In the "ntv" (Arabicized) version of the text, all the transcript tokens are arranged in logical order on each line, but because many lines have both Arabic (Farsi words) and ASCII (annotation tokens), text-display tools that try to support bi-directional text may not be able to put the transcript tokens in the expected display order, and/or may have trouble with the placement and direction of bracket characters in the in-line markup. It's possible to add Unicode directionality control characters to the text in order to get a proper display (or convert the file content into a form that can be displayed correctly in a browser), but nothing of that sort has been done in this release of the data. 3.2 data/xml/fa_*.xml The XML form of the transcripts contains both Arabicized and romanized forms for Farsi words. The basic XML structure is as follows (attribute values in parentheses refer to notes below; line breaks have been inserted among the "token" attributes for legibility): (h) ... Notes: (a) There is a distinct label for each speaker present in the transcript; this attribute value matches column 3 of the plain-text format, minus the ':' character (3.1). The "ch" attribute is "0" for channel A (first or 'left' channel), and "1" for channel B. (b) start and end are in seconds, and match the first two columns of the plain-text format. (c) raw is the original token as presented in the "*_asc.txt" version; this may be a markup tag ("", "", "{cough}", etc), a Farsi word (possibly with a "token-type" marker - see transcript_spec.txt), or a code-switched word or phrase in English (""). (d) type is one of the following: "normal" (Farsi word token) "propernoun" (Farsi name, marked with "&" in the raw value) "interjection" (marked with "%" in the raw value) "markup" (for "", etc) "foreign" (e.g. "<English bye bye >" -note character entities) "speaker_noise" ("{cough}", etc) "unintelligible" (always rendered as "(( ))") (e) clean is a version of "raw" minus certain token markers and bracketing (this attribute is not present if type="unintelligible") (f) solution and soltype are only present for Farsi words, and provide the native Farsi orthography and the provenance of that form (either "lexicon" or "manual") (g) lang only occurs when type="foreign"; it's value happens to always be "English". (h) Annotator_comment elements appear in only 181 tokens; a few consist of English comments, but most are in native Farsi orthography. 3.3 docs/recruit_demog.tab In this flat-table, tab-delimited, plain-text file, the first line contains column headings, as listed below: 1 file_id (fa_####) 2 gender M or F 3 educ number of years of formal education 4 age age in years at time of collection 5 raised city or location where raised This represents all the demographic information available for speakers in the corpus; it exists only for the people who were recruited to make calls - no information is available for the callees. The recruited caller appears on side "A" (first or 'left' channel of each recorded call), but some calls have multiple speakers on that side (as detailed in the "transcript_stats.tab" file, described below). 3.4 docs/transcript_stats.tab In this flat-table, tab-delimited, plain-text file, the first line contains the column headings, as listed below (there are footnotes below the list for items marked by "*"): 1 file_id (fa_####) 2 bgn offset in seconds to start of first transcribed segment 3 end offset in seconds to end of last transcribed segment 4 span total duration in seconds covered by the transcript 5 sphsec total sum in seconds of segment durations *(a) 6 segs number of segments in the transcript 7 ascTkns number of space-separated tokens in romanized text *(b) 8 ntvTkns number of space-separated tokens in Arabicized text *(b) 9 gender gender of speakers in the transcripts *(c) 10 A_nspk number of speakers transcribed on channel A 11 A_segs number of segments for each A speaker *(d) 12 B_nspk number of speakers transcribed on channel B 13 B_segs number of segments for each B speaker *(d) Notes: *(a) In 63 of the 100 files, the sum of segment durations adds up to more than the total span of audio covered by the transcript; this indicates a lot of overlap between the segments of the two channels in the call, which is due partly to speakers talking over each other, and partly to the tendency of the transcribers to include variable margins of non-speech regions when setting the boundaries of the segments to be transcribed. *(b) The ASCII and Arabicized token counts include not only the spoken words, but also annotation tokens (i.e. in-line markup, indicators of audible non-speech events, etc, described in 'transcript_spec.txt'). The "native" token count is always higher than the ASCII because there are a variety of cases where an item rendered as a single romanized word form by transcribers was converted to a string of two or more words in native Farsi orthography. *(c) The information available about the speakers is limited to what can be determined through listening to the recordings. The amount of detail derived from auditing was very limited, and the "gender" column only records whether or not speakers are of mixed gender, and if not, which gender is common to all speakers in the call. *(d) Columns 11 and 13 are variable length, with internal structure. If there is only one speaker on a given channel, the number of segments for that speaker is shown as "A:"=># or "B:"=># (where "#" is a number, usually greater than 100). If there are multiple speakers present on a given channel, different labels are given for each one, along with number of segments attributed to each - e.g.: file_id ... segs ... A_segs ... B_segs ... fa_7014 ... 456 ... "A:"=>210,"A2:"=>1 ... "B:"=>148,"B2:"=>97 These fields contain no spaces (only commas between speakers, double-quotes around speaker labels, and '=>' as a as a separator between the speaker label and the segment count). 4.0 Known Issues - We do not have demographic data on the recruited caller for file_id fa_5758. - The transcription of file_id fa_7003 is incomplete, containing utterances for speaker "A:" only (none of channel B was transcribed in this call). ------------------ README file created by David Graff, July 29, 2013.