------------------------------------------------------------------------------
                          WIKIPEDIA SPANISH CORPUS
                      Audio and Transcripts in Spanish
                    presented in a CIEMPIESS Corpus style
                 taken from the WikiProject Wikipedia Grabada
﻿------------------------------------------------------------------------------

﻿------------------------------------------------------------------------------
PRESENTATION
﻿------------------------------------------------------------------------------

According to the project page of the WikiProject Spoken Wikipedia, available
at: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spoken_Wikipedia

The WikiProject Spoken Wikipedia aims to produce recordings of Wikipedia 
articles being read aloud. Therefore, the WIKIPEDIA SPANISH CORPUS is a 
dataset created from the Spanish version of the WikiProject Spoken Wikipedia, 
called Wikipedia Grabada, available at: 
https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada

The WIKIPEDIA SPANISH CORPUS aims to be used in the Automatic Speech 
Recognition (ASR) task. It has a similar organization as the CIEMPIESS LIGHT 
CORPUS published in 2017 at the Linguistic Data Consortium (LDC2017S23). That 
is why we say that it is presented in the style of the CIEMPIESS Corpus.

The WIKIPEDIA SPANISH CORPUS is a gender unbalanced corpus of 25 hours of 
duration. It contains read speech of several articles of the Wikipedia 
Grabada; most of such articles are recorded by male speakers. Transcriptions
in this corpus were generated from the scratch by native speakers.

﻿------------------------------------------------------------------------------
CORPUS CHARACTERISTICS
﻿------------------------------------------------------------------------------

The WIKIPEDIA SPANISH CORPUS (WSC) has the following characteristics:

- The WSC has an exact duration of 25 hours and 37 minutes. It has 11569 
  audio files.

- The WSC counts with 193 different speakers: 150 men and 43 women. 

- Every audio file in the WSC has a duration between 3 and 10 seconds 
  approximately.

- Data in WSC is classified by speaker. It means, all the recordings of 
  one single speaker are stored in one single directory.

- Data is also classified according to the gender (male/female) of the 
  speakers.

- Audio and transcriptions in the WSC are segmented and transcribed from 
  the scratch by native speakers of the Spanish language

- Audio files in the WSC are distributed in a 16khz@16bit mono format.

- Every audio file has an ID that is compatible with ASR engines such as 
  Kaldi and CMU-Sphinx.

﻿------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
﻿------------------------------------------------------------------------------

The WIKIPEDIA_SPANISH directory contains the following files and directories:

	- data/files:	 	One can find the transcription file, the paths file 
                        as well as the "Speaker_Info.xls" file that contains 
                        relevant information about all the sepakers in the 
                        corpus.

	- data/speech:		One can find the speech files classified by gender 
                        (male/female).

	- docs/README.txt	This file.

﻿------------------------------------------------------------------------------
THE CORPUS FILES
﻿------------------------------------------------------------------------------

In the "files" directory one can find the following:

- WIKIPEDIA_SPANISH.transcription : This is the transcription file in plain 
                                    text format.

- WIKIPEDIA_SPANISH.paths         : This file contains the relative paths 
                                    from the "speech" directory to every 
                                    particular speech file.

- Speaker_info.xls                : This file contains relevant information 
                                    about the speakers. Specifically: Number 
                                    of audios per speaker and the total 
                                    amount of time of speech per speaker.

- Source_Files.list               : This file contains a list of the original
                                    audio files used to create the corpus.

﻿------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
﻿------------------------------------------------------------------------------

Every audio file in the WIKIPEDIA SPANISH CORPUS has an identification key 
with the following format:

                          WKSP_M_0010_E1_0015


     WKSP            M            0010       E1           0015
   Acronym      Gender of        Number    Edition   Number of the
     for        the Speaker:       of        One     audio file of
  "WIKIPEDIA    "M" for Male     Speaker              a particular
    SPANISH"    "F" for Female                          speaker

﻿------------------------------------------------------------------------------
AUTHORS
﻿------------------------------------------------------------------------------

Corpus Creation: Carlos Daniel Hernández Mena, Iván Vladimir Meza Ruiz
Final Edition: Carlos Daniel Hernández Mena

﻿------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
﻿------------------------------------------------------------------------------

The authors would like to thank to Alberto Templos Carbajal, Elena Vera and 
Angélica Gutiérrez for their support to the social service program
"Desarrollo de Tecnologías del Habla" at the Facultad de Ingeniería (FI) of 
the Universidad Nacional Autónoma de México (UNAM). We also thank to the 
social service students for all the hard work.

﻿------------------------------------------------------------------------------
﻿------------------------------------------------------------------------------
        To find Corpora similar to this visit: www.ciempiess.org
﻿------------------------------------------------------------------------------
﻿------------------------------------------------------------------------------