OVERVIEW OF TABLE DIFFERENCES: Version 2.0 vs. Version 3.0 A. Entries per table (counting lines that do not begin with ";") table_name v2 v3 unchanged ----------------------------------------- dictPrefixes 548 1296 193 dictSuffixes 906 945 894 dictStems 78839 79216 76163 tableAB 2435 2445 2429 tableAC 1138 1152 1138 tableBC 1612 1613 1608 B. Distinct lemmas in dictStems (counting lines beginning with ";; ") v2 v3 unchanged 40219 40561 39998 C. Number of distinct POS tags v2 v3 167 176 D. POS tags present in v2.0 but absent in v3.0 EMPH_PART -- replaced by EMPHATIC_PART EXCEPT_PART -- replaced by RESTRIC_PART FUT -- merged with FUT_PART JUS -- replaced by JUS_PART LATIN -- replaced by FOREIGN_SCRIPT NUM -- replaced by ADJ_NUM and NOUN_NUM PVSUFF_SUBJ:2FD -- merged with PVSUFF_SUBJ:2D PVSUFF_SUBJ:2MD -- merged with PVSUFF_SUBJ:2D SUB -- li/SUB merged with li/PREP VERB_PERFECT -- merged with PV E. POS tags introduced in v3.0 ADJ_COMP ADJ_NUM CONNEC_PART DEM_PRON DEM_PRON_D DEM_PRON_P EMPHATIC_PART EXCLAM_PRON FOREIGN FOREIGN_SCRIPT INTERROG_ADV INTERROG_PRON JUS_PART NOUN_NUM NOUN_QUANT PSEUDO_VERB RESTRIC_PART VERB VOC_PART F. Open-class stems built from dictStems (ADJ*,ADV,CV,IV,PV,NOUN*): pos v2 v3 --------------------- ADJ* 9935 10360 ADV 67 50 CV 55 96 IV 13465 13479 PV 17289 17343 IV_PASS 2768 2812 PV_PASS 309 381 NOUN* 42573 42756 --------------------- total: 86461 87277 Explanation: ADJ* and NOUN* refer to all ADJ and NOUN tags, including subcategories such as ADJ_COMP, ADJ_NUM, NOUN_PROP, NOUN_QUANT, etc. These numbers are based on the Berkeley DB file "dictStems.db" built from the original dictStems text table (using sama-dbm2txt to extract the DB file contents). The values above represent the number of distinct combinations of lookup stem, diacritized stem, POS tag, gloss and lemma_id, looking only at stems having the POS tags listed. The decrease in ADV entries is due to changing the POS labels of some stems from ADV to something else. G. Overall impact of table changes on inventory of analyzable forms 76,626,346 analysis look-ups supported in v2.0 80,095,844 analysis look-ups supported in v3.0 Explanation: The analysis logic in SAMA is governed by the tableAB, tableAC and tableBC look-ups, which dictate the permissible prefix-stem, prefix-suffix and stem-suffix combinations, respectively, for entries in the three dict* files. The "sama-enumerate" utility, included with the v3.0 release, creates a report file called "combo-code.triples", which provides a means for estimating the approximate number of word forms analyzable with a given version of the tables. This is done by counting the number of dict* entries associated with each prefix-stem-suffix combination permitted by the table* look-ups, and multiplying out the combinations of prefix, stem and suffix entries that are possible. For example, when "sama-enumerate" is run on the v3.0 tables, the first line of "combo-code.triples" is: 23 * 2840 * 1 = 65320 forms generated for: IVPref-AnA->a IV IVSuff-a What this means is: - 23 entries in dictPrefixes use the code IVPref-Ana->a - 2840 entries in dictStems use the code IV - 1 entry in dictSuffixes uses the code IVSuff-a - since these three codes can co-occur when analyzing a word form, 65320 forms result from permutations of those dictionary entries. By summing over the fourth numeric column in the "combo-code.triples" file, we get the total number of prefix-stem-suffix permutations that are covered by a given set of tables. This total is only approximate. On the one hand, it overestimates the number of distinct input (undiacritized) word forms covered, because many of the permutations will have identical undiacritized forms. On the other hand, it underestimates the total number of distinct analyses produced, in part because the 6 case-ending analyses normally produced for each noun are not taken into account in the enumeration statistics (each noun stem counts only once in this tally, but produces 6 distinct case-ending analyses).