This release of the Standard Arabic Morphological Analyzer (SAMA v3.1) provides updates relative to the previous release (3.0) on all components of the package: the core SAMA.pm module, the command-line scripts, and the files containing dictionary and morphotactic data. The installation procedure has also changed -- please see the file called INSTALL. Because the initial SAMA release (3.0) was a significant departure from the earlier Buckwalter Arabic Morphological Analyzer (BAMA 2.0), the present release is repeating the parts of SAMA 3.0 documentation that described its differences relative to BAMA. Users who are migrating directly from BAMA 2.0 to SAMA 3.1 will want to review the files "Changes.as-of-3_0" and "table_updates.v3_0.txt", for more background information about the BAMA-to-SAMA transition. For a more detailed summary of dictionary/table changes that are being introduced for the first time in v3_1, please refer to the file "table_updates.v3_1.txt". The remainder of this file reviews changes to the core SAMA.pm module and the various command-line scripts. Please consult the source code documentation for full details. CHANGES IN SAMA.pm * Updated user documentation The SAMA user manual (stored in the source code of SAMA.pm and viewable via the command "perldoc SAMA") has been significantly revised to keep it consistent with changes in the code, and to improve overall clarity and organization. * New option for controlling tokenization There is now an optional "markup" parameter for controlling the behavior of the tokenizer on non-Arabic text; when the "markup" paramater is set, portions of input text that resemble HTML/SGML/XML tags and entity references will be output as single tokens, rather than having their angle brackets, slashes, etc, split off as separate tokens. * New option for controlling the set of analyses returned There is now an optional "exclude" parameter, which may be used alone or in combination with the "match" parameter when calling "analyze" on a text string, so that analyses that match a given regex pattern will not be returned as possible solutions. * Caching of solutions is turned on by default In SAMA 3.0, caching of solutions was not enabled by default, but in most typical uses of the module (analyzing whole Arabic documents in a single run), caching improves run-time performance by a factor of about 2 (i.e. the process takes about half the time with caching enabled), so as of 3.1, caching is turned on by default. * Bug-fixes - A problem in the handling of punctuation marks that precede words (e.g. open-quotes, open-parens) has been fixed, and these are now tokenized according to the specs in the module documentation. - A problem in the behavior of the "cache" option has been fixed; now, cached solutions are stored and retrieved based on the original orthographic form of each input word, rather than on the normalized orthographic form. - When the "partials" option was set, rows returned for word fragments had fewer tab-delimited fields than rows for other types of tokens; this has been fixed so that the set of fields per row is consistent for all token types, including "partial" words. CHANGES IN sama-analyze * Option flags have been added to allow command-line control of the new "exclude" and "markup" options in SAMA.pm, and user documentation has been updated accordingly CHANGES IN sama-tbl2dbm * Tests have been added to check for possible character errors in the Buckwalter Arabic fields of the dict* input tables; now, if any Arabic string in a dict entry contains one or more of "GJPRV" (Buckwalter codes for Arabic letters that are not used in MSA), or any non-Buckwalter character, the process exits with an error message and new DB_File tables are not built. * Documentation and error messages have been updated. CHANGES IN sama-enumerate * Command-line usage has been reorganized to permit more detailed control over the amount and types of output created; most significantly, it is now possible to enumerate all word forms (with their analyses) for any chosen (combination of) lexical codes. * Documentation has been updated to explain the new command-line syntax, and to provide more information about the available kinds of outputs. * Bug-fixes - In the 3.0 release, sama-enumerate was not consistent with the other components in the release, and would not run as supplied (it had "use BAMA" where it should have had "use SAMA"). CHANGES IN sama-dbm2txt * Bug-fix: amended so that DB files are explicitly set for read-only access; the previous version failed if the owner lacked write access on the DB file, even though the script would never change the DB file contents.