DoWLS-MAN Database of Word-level Statistics (Mandarin) authors: Karl David Neergaard, Hongzhi Xu, Chu-Ren Huang This database provides lexical characteristics of a descriptive and statistical nature for words and nonwords of Mandarin Chinese. It is designed for researchers particularly concerned with language processing of isolated words. The database is divided into 32 files, separated according to the segmentation schema (8 tonal and 8 nontonal) used to calculate phonological similarity between words. Half of the files are for words, the other half nonwords. Each file consists of invariant and variant lexical characteristics Tonal segmentation schemas: C_G_V_C_T, C_G_V_X_T, C_V_C_T, C_G_VX_T, C_GVX_T, CG_V_X_T, CG_VX_T, CGVX_T Non-tonal segmentation schemas: C_G_V_C, C_G_V_X, C_V_C, C_G_VX, C_GVX, CG_V_X, CG_VX, CGVX *** Segmentation is done by adding blank spaces between units *** Examples per each segmentation schema with the syllable: xiao3/ɕiaʊ3/(小) & xian4/ɕiɛn4/(线) Tonal Schema Sampa Sampa C_G_V_C_T X i aʊ 3 X i E n 4 C_G_V_X_T X i a ʊ 3 X i E n 4 C_V_C_T X iaʊ 3 X iE n 4 C_G_VX_T X i aʊ 3 X i En 4 C_GVX_T X iaʊ 3 X iEn 4 CG_V_X_T Xi a ʊ 3 Xi E n 4 CG_VX_T Xi aʊ 3 Xi En 4 CGVX_T Xiaʊ 3 XiEn 4 Non-tonal schema Sampa C_G_V_C X i aʊ X i E n C_G_V_X X i a ʊ X i E n C_V_C X iaʊ X iE n C_G_VX X i aʊ X i En C_GVX X iaʊ X iEn CG_V_X Xi a ʊ Xi E n CG_VX Xi aʊ Xi En CGVX Xiaʊ XiEn _________________________________ Invariant lexical characteristics Key: Searchable phonological word without spaces, i.e., sampa without spaces (unsegmented) of phonological words. Segmented versions of the same sampa can be found for each of the schemas under Pho_T or Pho_NoT below. PY_T: Pinyin with tone PY_NoT: Pinyin without tone IPA_T: IPA with tone IPA_NoT: IPA without tone Onset: Item onset according to sampa dictionary below Tone: Lexical tone number/s SyStruct: Syllable structure according to the CGVX system SyLen: Syllable length PyLen: Pinyin length Dom_POS: Dominant part of speech (POS) Freq_Dom_POS: Frequency of the dominant POS Percent_Dom_POS: The percent that the Dom_POS represents "Key_T" in comparison to other POS assignments for the same "Key_T" Other_POSes: The other parts of speech assignments that were not of the highest percentage _______________________________ Variant lexical characteristics Pho_T: Tonal segmented phonological word, i.e., the phonological word as represented in sampa, segmented according to a segmentation schema. The unsegmented version of this column is either Key_T. Pho_NoT: Non-tonal segmented phonological word, i.e., the phonological word as represented in sampa, segmented according to a segmentation schema. The unsegmented version of this column is Key_NoT. PhoLen: Phoneme length, i.e., the number of units within the item after being segmented according to one of the 16 segmentation schemas FreqPM: Subtitle movie corpus lexical frequency (per million), i.e., the summed subtitle movie corpus frequency per million, for each phonological word seen in Key. Homophones: Simplified Chinese character/s that correspond to the phonological word presented in Key HD: Homophone density, i.e., the number of words that share the same pronunciation as seen in Pho_T for tonal files and Pho_NoT for non-tonal files PND: Phonological neighborhood density, i.e., The total number of phonological neighbors after calculating the replacement, addition, and subtraction of a segment between the target word and the top 30,000 most frequent phonological words Sub_PND: Substitution neighbors, i.e., the number of substitution neighbors between the target word and the top 30,000 most frequent phonological words Add_PND: Addition neighbors, i.e., the number of addition neighbors between the target word and the top 30,000 most frequent phonological words Del_PND: Deletion neighbors, i.e., the number of deletion neighbors between the target word and the top 30,000 most frequent phonological words Neighbors: All phonological neighbors presented in sampa NF: Neighborhood frequency, i.e., the mean frequency of all of the phonological neighbors seen in Neighbors CC: Clustering coefficient, i.e., the ratio of neighbors that are neighbors of each other as measured through the igraph R function for local transitivity() _______________________________ Notes on the word list The original frequency counts are adapted from the word list of Subtlex-CH (Cai & Brysbaert, 2010). Monosyllables from the Subtlex-CH character list that were not present as monosyllabic words were added to the list in order to provide statistical information for all Mandarin syllables. They were however given a frequency count of 1 and accordingly do not contribute to measures of phonological neighborhoods. _____________________________________________ IPA to sampa conversion chart with examples The following conversions come from Neergaard & Huang (2019) IPA -> Sampa -> Pinyin Syllable -> Sampa Syllable -> Character a -> a -> ba3 -> pa3 -> 把 ə -> @ -> she4 -> S@4 -> 蛇 e -> e -> gei3 -> keI3 -> 给 ɛ -> E -> ye3 -> iE3 -> 也 ɨ -> 1 -> zhi1 -> Z11 -> 之 i -> i -> di4 -> ti4 -> 第 ɪ -> I -> sui4 -> sueI4 -> 岁 o -> o -> ruo4 -> ruo4 -> 若 ʊ -> U -> chou3 -> CoU3 -> 丑 u -> u -> wo3 -> uo3 -> 我 y -> y -> yuan2 -> yEn2 -> 元 m -> m -> ma1 -> ma1 -> 妈 n -> n -> neng2 -> n@N2 -> 能 ŋ -> N -> xiang3 -> XiaN3 -> 想 l -> l -> lie4 -> liE4 -> 列 r -> r -> rang4 -> raN4 -> 让 p -> p -> bu4 -> pu4 -> 不 pʰ -> P -> pao3 -> PaU3 -> 跑 k -> k -> ge0 -> k@0 -> 个 kʰ -> K -> ke4 -> K@4 -> 课 t -> t -> dou1 -> toU1 -> 都 tʰ -> T -> ta1 -> Ta1 -> 他 s -> s -> suo3 -> suo3 -> 所 f -> f -> fang4 -> faN4 -> 放 x -> x -> hui4 -> xueI4 -> 会 ʂ -> S -> shi4 -> S14 -> 是 ɕ -> X -> xia4 -> Xia4 -> 下 tɕ -> J -> jiu4 -> JioU4 -> 就 tɕʰ -> Q -> qing3 -> QiN3 -> 请 tsʰ -> c -> cong2 -> coN2 -> 从 tʂʰ -> C -> chu1 -> Cu1 -> 出 ts -> z -> zi4 -> z14 -> 字 tʂ -> Z -> zhe -> Z@4 -> 这 Reference: Cai, Q. & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS ONE 5(6): e10729. http://dx.doi.org/10.1371/journal.pone.0010729 Neergaard, K. & Huang, C. (2019). Constructing the Mandarin Phonological Network: Novel Syllable Inventory Used to Identify Schematic Segmentation. Complexity, Article 6979830, 1-21. http://202.171.253.68/downloads.hindawi.com/journals/complexity/2019/6979830.pdf