NATO Native and Non-Native (N4) Speech Corpus V1.1

README

1. Publication Title:

NATO Native and Non-Native (N4) Speech Corpus V1.1

POC:John Grieco, 315-330-7672 (John.Grieco@afrl.af.mil)

Tech/Admin POC: Pat Ryan, 315-334-6990 (pryan@caci.com)

2. Authors:

Laurent Benarousse, Edouard Geoffrois

DGA/CTA/GIP, 16 bix avenue Prieur de la C te d’Or, F-94114 Arcueil cedex, France

Laurent.Benarousse@etca.fr, Edouard.Geoffrois@etca.fr

John Grieco

Air Force Research Laboratory/IFEC, 32 Brooks Rd., Rome NY 13441, USA

John.Grieco@afrl.af.mil

Robert Series

20/20 Speech Ltr., MHSP Geraldine Rd., Malvern Worcs. WR 14 3SZ, United Kingdom

r.series@2020speech.com

Herman Steeneken

TNO Human Factors, P.O. Box 23 3769 ZG Soesterber, The Netherlands

steeneken@tm.tno.nl

Hans Stumpf

Bundessprachenamt, Horbeller Strasse 52, 50354 Huerth, Germany

hans.w.stumpf@t-online.de

Carl Swail

Flight Research Laboratory, Buidling U-61, Montreal Rd., Ottawa Ontario, Canada

Carl.Swail@nrc.ca

Dieter Thiel

ZU-StellenBwTafkl, Kulmbacherst. 58-60, D-95032 Hof, Germany

Dieter.Thiel@bnhof.de

3. Data Type: Text, Speech

4. Data Sources: Collection Type: Microphone

Years of Data Collection: 2000-2002

5. Project and Purpose of the Task: The NATO Native and Non-Native (N4) Corpus has been developed by the NATO research group on Speech and Language Technology, in order to provide a military-oriented database for multilingual and non-native speech processing studies.

6. Applications: This corpus can be used for various studies, including the influence of non-native ness on speech, language and speaker recognition, and accent recognition.

7. Languages: Speech data has been recorded in the naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom and Canada). The material mainly consists in NATO English procedure between ships. In addition, the same speakers read a text (“The North Wind and the Sun”) both in English and in their mother tongue. The number of speakers per country ranges from 11 to 51 for a total of 115. The duration of speech ranges from 1.6h to 3.0h, for a total of around 9.5h.

8. Special License: N/A

9. Funding Agency and Grant Number: N/A

10. Copyright: N/A

11. Description of the Corpus Structure and Data Attributes:

Data Type (Text, Speech, Video, Etc.) And File Formats:

Speech File Format is: NIST SPHERE

Text File Formats are: Transcriber (.trs)

Microsoft Word (.doc) & (.xml)

Encoding: ASCII

Number of Files, Size of the Data:

989 Files / 15 Folders

2.14 GB

Directory Contents:

NATO_Corpus_for_LDC Folder(Properties 2.14GB, 993 files, 16 Folders)

N4_Corpus_README.doc

Spelling_changes.txt

CA Folder (Properties 586MB, 32 Files, 3 Folders)

CA_Audio_Sphere Folder (Contains 15 Audio Sphere Files)

CA_Docs Folder (Contains: CA_Speaker_Data.doc and

CA_Speaker_Data.xml)

CA_Trans Folder (Contains 15 Transcriber Files)

DE Folder (Properties 361MB, 894 Files, 3 Folders)

DE_Audio_Sphere Folder (Contains 445 Audio Sphere Files)

DE_Docs Folder (Contains: DE_Speaker_Data.doc;

DE_Speaker_Data.xml; NORTHWIND.TXT and WORTLIST.TXT )

DE_Trans Folder (Contains 445 Transcriber Files)

NL Folder (Properties 556MB, 43 Files, 3 Folders)

NL_Audio_Sphere Folder (Contains 19 Audio Sphere Files)

NL_Docs Folder (Contains: NL_README.doc;

NL_README.xml; NL_Speaker_Data.doc; NL_Speaker_Data.xml; NORTHWIND.WPD)

NL_Trans Folder (Contains 19 Transcriber Files)

UK Folder (Properties 691MB, 18 Files, 3 Folders)

UK_Audio_Sphere Folder (Contains 7 Audio Sphere Files)

UK_Docs Folder (Contains: UK_README.doc;

UK_README.xml; UK_Speaker_Data.doc; UK_Speaker_Data.xml)

UK_Trans Folder (Contains 7 Transcriber Files)