#1-Introduction The wsj part of the DIRHA English Dataset [1] is a multi-microphone acoustic corpus being developed under the EC project Distant-speech Interaction for Robust Home Applications (DIRHA). The corpus is composed of both real and simulated sequences recorded with 32 sample-synchronized microphones in a domestic environment. The database contains signals of different characteristics in terms of noise and reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 6 native US speakers (3 Males, 3 Females) uttering 409 wsj sentences. The database can be coupled with the related Kaldi baselines and the related tools that can be downloaded here: https://github.com/SHINE-FBK/DIRHA_English_wsj #2-Microphone Network The reference scenario is a real apartment equipped with 32 microphones spread in the living-room (26 mics) and in the kitchen (6 mics) . The microphone network is composed of 2 circular arrays of 6 microphones (located on the ceiling of the living-room and the kitchen), a linear array of 11 sensors (located in the living-room) and 9 microphones distributed on the living-room walls. All the channels are high-quality omni-directional microphones (Shure MX391/O), except for the microphones of the linear array which are cheaper Electret sensors. The channels are connected to multi-channel clocked pre-amp and A/D boards (RME Octamic II), which allowed a high-quality and perfectly sample-synchronized acquisition. For all the considered arrays (i.e, the two ceiling arrays and the linear one) a delay-and-sum beamformed signal (with ideal delays) has been generated and released along with the other channels. #3-Acoustic Sequences The database is composed of 75 real and 75 simulated sequences. All the sequences last 1-minute and are simultaneously observed by all the microphones of the network. Each sequence contains a variable number (ranging from 4 to 6) of wsj sentences which are uttered in different positions of the living-room. The sequences also include typical domestic background noises as well as inter/intra-room reverberation effects. For the real part of the dataset, each subject was positioned in the living-room and read the material from a tablet,standing still or sitting on a chair, in a given position. The simulated sequences have been generated with the contamination approach described in [2], which is based on the combination of high-quality close-talking recordings, high-quality multi-microphone impulse responses (measured with Exponential Sine Sweep method) and recorded noisy sequences. The original recordings were based on a sampling frequency of 48 kHz. However, for the sake of compactness, the released signals are in wav format with 16 kHz sampling frequency and 16 bit resolution. #4-Dataset Organization The database is organized as follows: -“DIRHA_English_wsj” contains the 75 simulated and 75 real 1-minute acoustic sequences with the related xml annotations. -Additional_info: contains the information about channels and speakers. The next sections provide more details on data annotation and on the provided kaldi baselines&tools. #5-Data Annotation The annotations of the acoustic sequences are released in xml format. For each channel of each acoustic sequence the xml annotation file reports: the name and the coordinates of the microphone (in cm) under the tag “” the source id, the speaker id, the speaker gender, the speaker position, the initial and final sample where the source is active as well as the text uttered by the speaker under the tag ““ Moreover, in the folder “Additional_info” other potentially useful information such as a file listing all the considered channels, a file reporting all the possible speaker positions and a file showing additional speaker informations (such as gender, age and height) are released. In the same folder, a floorplan and some photos of the considered apartment complete the provided documentation. #6-Kaldi Baselines & Tools The Kaldi baselines released with the DIRHA English dataset can be downloaded at the link below. At the same link one can find the tools to perform the data contamination of the original wsj dataset and to extract the DIRHA wsj sentences from the 1-mnute sequences with an oracle VAD. https://github.com/SHINE-FBK/DIRHA_English_wsj #7-References: Please, cite the following paper if you use the DIRHA English WSJ dataset: [1] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015. [2] M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition",in Proceedings of Interspeech 2016.