Documenting Xi'an Guanzhong - Object Naming V1.0 Documenting Xi'an Guanzhong - Object Naming is a corpus of audio recordings of speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province, naming objects that appear in colored line drawings. The corpus was developed to support traditional and computer aided language documentation. Languages Guanzhong dialect of Mandarin Chinese (cmn) as spoken in Xi'an and it environs. Recommended/Expected use of corpus The corpus was developed to support traditional and computer aided language documentation research. Collection Procedure This collection used object naming to elicit speech in the target linguistic variety. Speakers of the target variety were presented images selected from the 750 image MultiPic dataset (https://www.bcbl.eu/databases/multipic/) and asked to record themselves naming the objects in the images. Before speech collection an annotator, native speaker of Mandarin, Linguistics student and lifetime resident in China, reviewed each image to identify those that would be appropriate for the naming task involving other residents of China. The annotator marked each image as good or bad for the purpose but could also skip the item or report problems such as the image not fully loading. Where this first annotator was uncertain, images were reviewed by a more senior linguist among the corpus authors. From among the 750 images MultiPic images, 622 were selected. Images were presented an judgments collected via LanguageARC. Speech contributors were native speakers of the Guanzhong dialect and students of English translation, selected by the some of the corpus authors who are themselves native speakers. The 622 images were then presented to ~59 contributors via LanguageARC where they were asked to record themselves naming the objects in the images. Each naming took a few seconds to accomplish. Contributors could proceed at their own pace, skip items, leave the task and return as they wished. The task included a tutorial in standard Chinese on making high quality audio recordings. The speech collection ran from February through May 2021. The object naming task yielded 34,729 audio recordings. Quality varied according to network conditions, hardware availability and the contributors' environments and behaviors. For example, due to imperfections in the standard HTML5 libraries used to collect the audio, some recordings contained strings of NULLs and/or discontinuities in the speech signals. Others suffered very low signal to noise ratios or were truncated probably due to environmental noise and contributor behavior, respectively. Custom detectors were developed and deployed to identify problematic files which have been removed from this release. Native speakers of the Guanzhong dialect, a subset of the speech contributors, reviewed each item that was not excluded for the technical problems noted above and indicated whether the recording was truncated or contained no speech; contained an artifact of the digitization problems as noted above; was too soft, suffered from excessive background noise or was in the wrong language/dialect. Files confirmed to have been truncated or marred by digital artifacts were removed from this release. The remaining 25,972 files were retained but marked in the metadata if they were annotated for the other problems listed above. Data Format Specific Details 59 userIDs uttered namings of the objects in up to 622 images each. As this data was collected via a web-based Citizen Science portal, albeit from a closed community of volunteers, there is a small chance of the same speaker being given multiple userIDs. Some speakers skipped some items and some named the same items multiple times. The data is organized into 622 directories according to the image presented. Each directory contains on average 42 recordings of namings of the object in the image (min=7, max=54, std=4.6) sampled at 16kHz, 16bit, single channel, FLAC encoded files. References For more information on MultiPic, see Duñabeitia, Jon Andoni, Davide Crepaldi, Antje S. Meyer, Boris New, Christos Pliatsikas, Eva Smolka, Marc Brysbaert. 2018. MultiPic: A standardized set of 750 drawings with norms for six European languages. Quarterly Journal of Experimental Psychology. 71:4. pp. 808-816 For more information on LanguageARC see: Fiumara, James, Christopher Cieri, Jonathan Wright, Mark Liberman. 2020. LanguageARC: Developing Language Resources Through Citizen Linguistics. LREC 2020: 12th Edition of the Language Resources and Evaluation Conference. CLLRD Workshop: Citizen Linguistics in Language Resource Development. Marseille, May 11-16.