NATO Native and Non-Native (N4) Speech Corpus V1.1
README
1. Publication Title:
POC:John Grieco, 315-330-7672 (John.Grieco@afrl.af.mil)
Tech/Admin POC: Pat Ryan, 315-334-6990 (pryan@caci.com)
2. Authors:
Laurent Benarousse, Edouard Geoffrois
DGA/CTA/GIP, 16 bix avenue Prieur de la C te d’Or, F-94114 Arcueil cedex, France
Laurent.Benarousse@etca.fr, Edouard.Geoffrois@etca.fr
John Grieco
Air Force Research Laboratory/IFEC, 32 Brooks Rd., Rome NY 13441, USA
Robert Series
20/20 Speech Ltr., MHSP Geraldine Rd., Malvern Worcs. WR 14 3SZ, United Kingdom
Herman Steeneken
TNO Human Factors, P.O. Box 23 3769 ZG Soesterber, The Netherlands
Hans Stumpf
Bundessprachenamt, Horbeller Strasse 52, 50354 Huerth, Germany
Carl Swail
Flight Research Laboratory, Buidling U-61, Montreal Rd., Ottawa Ontario, Canada
Dieter Thiel
ZU-StellenBwTafkl, Kulmbacherst. 58-60, D-95032 Hof, Germany
3. Data Type: Text, Speech
4. Data Sources: Collection Type: Microphone
Years of Data Collection: 2000-2002
5. Project and Purpose of the Task: The NATO Native and Non-Native (N4) Corpus has been developed by the NATO research group on Speech and Language Technology, in order to provide a military-oriented database for multilingual and non-native speech processing studies.
6. Applications: This corpus can be used for various studies, including the influence of non-native ness on speech, language and speaker recognition, and accent recognition.
7. Languages: Speech data has been recorded in the naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom and Canada). The material mainly consists in NATO English procedure between ships. In addition, the same speakers read a text (“The North Wind and the Sun”) both in English and in their mother tongue. The number of speakers per country ranges from 11 to 51 for a total of 115. The duration of speech ranges from 1.6h to 3.0h, for a total of around 9.5h.
8. Special License: N/A
9. Funding Agency and Grant Number: N/A
10. Copyright: N/A
11. Description of the Corpus Structure and Data Attributes:
Data Type (Text, Speech, Video, Etc.) And File Formats:
Speech File Format is: NIST SPHERE
Text File Formats are: Transcriber (.trs)
Microsoft Word (.doc) & (.xml)
Encoding: ASCII
Number of Files, Size of the Data:
989 Files / 15 Folders
2.14 GB
Directory Contents:
NATO_Corpus_for_LDC Folder(Properties 2.14GB, 993 files, 16 Folders)
N4_Corpus_README.doc
Spelling_changes.txt
CA Folder (Properties 586MB, 32 Files, 3 Folders)
CA_Audio_Sphere Folder (Contains 15 Audio Sphere Files)
CA_Docs Folder (Contains: CA_Speaker_Data.doc and
CA_Speaker_Data.xml)
CA_Trans Folder (Contains 15 Transcriber Files)
DE Folder (Properties 361MB, 894 Files, 3 Folders)
DE_Audio_Sphere Folder (Contains 445 Audio Sphere Files)
DE_Docs Folder (Contains: DE_Speaker_Data.doc;
DE_Speaker_Data.xml; NORTHWIND.TXT and WORTLIST.TXT )
DE_Trans Folder (Contains 445 Transcriber Files)
NL Folder (Properties 556MB, 43 Files, 3 Folders)
NL_Audio_Sphere Folder (Contains 19 Audio Sphere Files)
NL_Docs Folder (Contains: NL_README.doc;
NL_README.xml; NL_Speaker_Data.doc; NL_Speaker_Data.xml; NORTHWIND.WPD)
NL_Trans Folder (Contains 19 Transcriber Files)
UK Folder (Properties 691MB, 18 Files, 3 Folders)
UK_Audio_Sphere Folder (Contains 7 Audio Sphere Files)
UK_Docs Folder (Contains: UK_README.doc;
UK_README.xml; UK_Speaker_Data.doc; UK_Speaker_Data.xml)
UK_Trans Folder (Contains 7 Transcriber Files)