OVERVIEW OF TABLE DIFFERENCES: Version 2.0 vs. Version 3.0

A. Entries per table (counting lines that do not begin with ";")

table_name	  v2	  v3	unchanged
-----------------------------------------
dictPrefixes	  548	 1296	  193
dictSuffixes	  906	  945	  894
dictStems	78839	79216	76163
tableAB		 2435	 2445	 2429
tableAC		 1138	 1152	 1138
tableBC		 1612	 1613	 1608


B. Distinct lemmas in dictStems (counting lines beginning with ";; ")

v2	v3	unchanged
40219	40561	39998


C. Number of distinct POS tags

v2	v3
167	176


D. POS tags present in v2.0 but absent in v3.0

EMPH_PART        --  replaced by EMPHATIC_PART
EXCEPT_PART      --  replaced by RESTRIC_PART
FUT              --  merged with FUT_PART
JUS              --  replaced by JUS_PART
LATIN            --  replaced by FOREIGN_SCRIPT
NUM              --  replaced by ADJ_NUM and NOUN_NUM
PVSUFF_SUBJ:2FD  --  merged with PVSUFF_SUBJ:2D
PVSUFF_SUBJ:2MD  --  merged with PVSUFF_SUBJ:2D
SUB              --  li/SUB merged with li/PREP
VERB_PERFECT     --  merged with PV


E. POS tags introduced in v3.0

ADJ_COMP
ADJ_NUM
CONNEC_PART
DEM_PRON
DEM_PRON_D
DEM_PRON_P
EMPHATIC_PART
EXCLAM_PRON
FOREIGN
FOREIGN_SCRIPT
INTERROG_ADV
INTERROG_PRON
JUS_PART
NOUN_NUM
NOUN_QUANT
PSEUDO_VERB
RESTRIC_PART
VERB
VOC_PART


F. Open-class stems built from dictStems (ADJ*,ADV,CV,IV,PV,NOUN*):

 pos	  v2	  v3
---------------------
ADJ*	 9935	10360
ADV	   67	   50
CV	   55	   96
IV	13465	13479
PV	17289	17343
IV_PASS	 2768	 2812
PV_PASS	  309	  381
NOUN*	42573	42756
---------------------
total:	86461	87277

Explanation:

ADJ* and NOUN* refer to all ADJ and NOUN tags, including subcategories
such as ADJ_COMP, ADJ_NUM, NOUN_PROP, NOUN_QUANT, etc.

These numbers are based on the Berkeley DB file "dictStems.db" built
from the original dictStems text table (using sama-dbm2txt to extract
the DB file contents).  The values above represent the number of
distinct combinations of lookup stem, diacritized stem, POS tag, gloss
and lemma_id, looking only at stems having the POS tags listed.  The
decrease in ADV entries is due to changing the POS labels of some
stems from ADV to something else.


G. Overall impact of table changes on inventory of analyzable forms

76,626,346 analysis look-ups supported in v2.0
80,095,844 analysis look-ups supported in v3.0

Explanation:

The analysis logic in SAMA is governed by the tableAB, tableAC and
tableBC look-ups, which dictate the permissible prefix-stem,
prefix-suffix and stem-suffix combinations, respectively, for entries
in the three dict* files.

The "sama-enumerate" utility, included with the v3.0 release, creates
a report file called "combo-code.triples", which provides a means for
estimating the approximate number of word forms analyzable with a
given version of the tables.  This is done by counting the number of
dict* entries associated with each prefix-stem-suffix combination
permitted by the table* look-ups, and multiplying out the combinations
of prefix, stem and suffix entries that are possible.

For example, when "sama-enumerate" is run on the v3.0 tables, the
first line of "combo-code.triples" is:

 23 * 2840 * 1 = 65320 forms generated for: IVPref-AnA->a IV IVSuff-a

What this means is:

 - 23 entries in dictPrefixes use the code IVPref-Ana->a
 - 2840 entries in dictStems use the code IV
 - 1 entry in dictSuffixes uses the code IVSuff-a
 - since these three codes can co-occur when analyzing a word form,
   65320 forms result from permutations of those dictionary entries.

By summing over the fourth numeric column in the "combo-code.triples"
file, we get the total number of prefix-stem-suffix permutations that
are covered by a given set of tables.

This total is only approximate.  On the one hand, it overestimates the
number of distinct input (undiacritized) word forms covered, because
many of the permutations will have identical undiacritized forms.  On
the other hand, it underestimates the total number of distinct
analyses produced, in part because the 6 case-ending analyses normally
produced for each noun are not taken into account in the enumeration
statistics (each noun stem counts only once in this tally, but
produces 6 distinct case-ending analyses).