This README.TXT file contains information about all files of the 200 adult speakers which were recorded for the Arko Urdu project. The recordings were carried out on behalf of US Army Research Laboratory in 2006.

In each session one speaker has been presented with 400 prompts to read: sentences, place names, and person names. Two microphones were used for the recordings, which were set at different distances to the speaker.


For the two speech files there is one SAM label file with information about the speech files. The label file has the extension URO, the speech files for the four microphones have the extensions UR0 and UR1.
The files of one speaker are located in directories named SESxxx; where xxx reprepsents the session number recorded. 
10 sessions were put into one block, these blocks are directories named BLOCKxx; where xx starts with zero (00). The blocks are located in the root directory \ADULT1UR. 

The speech files are separated from the SAM label files.
The volume names are ADULT1UR000-ADULT1UR010 for the speech files and ADULT1URD00 for the documentation files. Both volumes have the same directory structure.

Description of the files of the database:

Extension DOC denotes Microsoft Word for Windows files.
Extension PS denotes PostScript files.
Extension TXT denotes text file
Extension PDF denotes Adobe Portable Document Format


Contents of Root directory

README.TXT	database description file as plain text, this file
DISK.ID		volume name
COPYRIGH.TXT	copyright statement
ADULT1UR000	directory
ADULT1URD00	directory


Contents of directory ADULT1URD00

DOC		directory
TABLE		directory
INDEX		directory
BLOCK<NN>	directory


Contents of directory ADULT1UR\DOC

DESIGN.DOC	contains free text information about database (created in Tahoma font)
SAMPALEX.PS	table of SAMPA symbols used for the phoneme notation in TABLE\LEXICON.TBL 
SUMMAR0.TXT	contains description of all recording sessions (channel0) using following mnemonics:
			DIR		full directory path of the session
			SES		Session number
			CCD2N		2 strings with N corpus codes, where N is the number of total items. The 2 strings are separated by a space.
			RED		Recording date of first item
			RET		Recording time of first item	

Contents of directory ADULT1UR\TABLE	

LEXICON.TBL	lexicon file, alphabetically ordered table of distinct lexical items which occur in the corpus with the corresponding pronunciation information, ranking of alternative pronunciations and the frequencies of occurrence
REC_COND.TBL	table with information about the conditions of all recording sessions using following mnemonics:
			SES	session number
			MIP	microphone positions
			MIT	microphone types
			SCC	scenario code
SESSION.TBL	table with information about each recording session (speaker, recording environment, recording time and date) using following mnemonics:
			SES	session number
			SCD	unique speaker code
			REP	recording place
			RED	recording date
			RET 	recording time
SPEAKER.TBL	table with information about each speaker (code, gender, age, accent), using following mnemonics:
			SCD	unique speaker code
			SEX	speaker gender
			AGE	speaker age
			ACC	speaker accent


Contents of directory ADULT1UR\INDEX
	
CONTENT0.LST	transcription of each recorded utterance (encoding = UTF8), information about the speaker and environment), using following mnemonics:
			DIR	directory
			SRC	speech signal file name
			CCD	corpus code
			SCD	speaker code
			SEX	speaker gender
			AGE	speaker age
			ACC	speaker accent
			SCC 	scenario code
			LBO	speech transcription without the numerical data


Contents of directory ADULT1UR\BLOCK<NN>
(<NN> is a number from 00 to 20)

SES<NN><M>		directories for each recording session, 
			<NN> is the block number, 
			<M> is a session number from 0 to 9

Contents of directory ADULT1UR\BLOCK<NN>\SES<NN><M>
	
SA<NN><M><COR>.URO	SAM label file of item with corpus code <COR>
SA<NN><M><COR>.UR0	signal file of item with corpus code <COR>, channel 0
SA<NN><M><COR>.UR1	signal file of item with corpus code <COR>, channel 1