README File for the GIGAWORD ARABIC TEXT CORPUS
	  ===============================================


INTRODUCTION
------------

The Gigaword Arabic Corpus is a comprehensive archive of newswire
text data that has been acquired from Arabic news sources by the
Linguistic Data Consortium (LDC), at the University of Pennsylvania.

Four distinct sources of Arabic newswire are represented here:

 - Agence France Presse  (afa)
 - Al Hayat News Agency  (alh)
 - An Nahar News Agency  (ann)
 - Xinhua News Agency    (xia)

The three-character abbreviations shown above represent both the
directory names where the data files are found, and the 3-letter
prefix that appears at the beginning of every file name.

These news services all use Modern Standard Arabic (MSA), so there
should be a fairly limited scope for orthographic and lexical
variation due to regional Arabic dialects.  However, to the extent
that regional dialects might have an influence on MSA usage, it should
be noted that An Nahar is based in Beirut, Lebanon, and it may be safe
to assume that its material is created predominantly by speakers of
Levantine Arabic.  Al Hayat was originally a Lebanese news service as
well, but it has been based in London during the entire period
represented in this archive (and its owners are in Saudi Arabia, so it
is sometimes referred to as a Saudi news service); even so, much of
its reporting/editorial staff may be of Levantine origins.  The Xinhua
and AFP services are obviously international in scope (Xinhua is based
in Beijing, AFP in Paris), and we have no information about the
regional distribution of Arabic reporters and editors for these
services.

Much of the AFP content in this collection has been published
previously by the LDC in "Arabic Newswire Part 1" (LDC2001T55), and
some of this content has also been included in an Arabic supplement to
TDT3 and as the Arabic component of TDT4.  TDT4 also included a four
month sample from Al Hayat and An Nahar (October 2000 - January 2001).
Apart from that, all of the Al Hayat, An Nahar and Xinhua Arabic
content, as well as AFP content for 2001-2002, is being released here
for the first time.

Researchers who have already used the AFP content from LDC2001T55
should note that this material has been prepared differently for
inclusion in Gigaword Arabic: apart from using a simpler SGML markup
scheme and UTF8 character encoding, the Gigaword files present all
digit strings in "logical ordering" (most significant digit appears
first in a digit string), whereas the older CD-ROM release in 2001 had
digit strings in the original "right-to-left display ordering", as
delivered over the AFP newswire (least significant digit appeared
first).


CHARACTER ENCODING
------------------

The original data archives received by the LDC used three different
character encodings for Arabic: An Nahar provided their archives in
MacArabic, Xinhua and Al Hayat used CP1256, and AFP used a 7-bit
encoding called ASMO 499.  (In the earlier release of AFP Arabic data,
this was converted to ISO 8859-6, and that encoding served as the
source form for preparing the Gigaword release.)  To avoid the
problems and confusion that could result from differences in
character-set specifications, all text files in this corpus have been
converted to UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file
(described in detail in the next section) shows up as lines of
single-byte-per-character (ASCII) text, whereas lines of actual text
data, including article headlines and datelines, contain a mixture of
single-byte and multi-byte characters.  In general, single-byte
characters in the text data will consist of digits and punctuation
marks (where the original source relied on ASCII punctuation codes,
rather than Arabic-specific punctuation), whereas multi-byte
characters consist of Arabic letters and a small number of special
punctuation or other symbols.  This variable-width character encoding
is intrinsic to UTF-8, and all UTF-8 capable processes will handle the
data appropriately.

The MacArabic encoding was designed to support ASCII digit characters
as well as the so-called Arabic-Indic digits, which have distinct
glyphs but are semantically equivalent to ASCII digits; CP1256 and
ASMO/ISO provide ASCII digits only.  On inspecting the An Nahar text
data, we found that both ASCII and Arabic-Indic digits were used, but
there seemed to be no rule or pattern to predict which set would be
used in a given instance.  In addition, because of the character
rendering assumptions that underly MacArabic encoding, strings of
Arabic-Indic digits are presented in text files using "right-to-left
display order" while ASCII digit strings use logical order.

Readers of Arabic always read digit strings in a manner equivalent to
readers of English and other left-to-right languages -- i.e. the most
significant digit is always displayed left-most in the string --
regardless of the glyphs being used for the digits.  In terms of
ordering digit characters in a data stream, "logical order" refers to
having the most significant digit presented first in the stream.  In
English and other left-to-right languages, "logical order" is
identical to "display order", but for Arabic, "logical order" is the
reverse of "right-to-left display order".

To minimize confusion and useless variability in the Gigaword text
files, we have converted all Arabic-Indic digits in An Nahar data to
their ASCII equivalents, and when these occurred in strings of 2 or
more digits, we have reversed the strings so that they are presented
in logical order in each file, to be consistent with the conventions
used in the other sources.

As noted in the introduction above, the original AFP source data
always used right-to-left display order for digit strings -- this is
because the service assumes the data are being supplied mainly to
printing devices that operate in a strict, linear right-to-left
fashion.  All digit strings in the AFP files have been reversed in the
Gigaword release to yield logical ordering.


DATA FORMAT AND SGML MARKUP
---------------------------

Each data file name consists of the 3-letter prefix, followed by a
6-digit date (representing the year and month during which the file
contents were generated by the respective news source), followed by a
".gz" file extension, indicating that the file contents have been
compressed using the GNU "gzip" compression utility (RFC 1952).  So,
each file contains all the usable data received by LDC for the given
month from the given news source.

All text data are presented in SGML form, using a very simple, minimal
markup structure.  The file "gigaword_a.dtd" in the "docs" directory
provides the formal "Document Type Declaration" for parsing the SGML
content.  The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using this DTD file.

The markup structure, common to all data files, can be summarized as
follows:

<DOC id="..." type="..." >
<HEADLINE>
The Headline Element is Optional -- not all DOCs have one
</HEADLINE>
<DATELINE>
The Dateline Element is Optional -- not all DOCs have one
</DATELINE>
<TEXT>
<P>
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
</P>
<P>
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
</P>
</TEXT>
</DOC>

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a
corresponding "closing" tag -- always.  The attribute values in the
DOC tag are always presented within double-quotes; the "id=" attribute
of DOC consists of the 3-letter source abbreviation (in CAPS), an
8-digit date string representing the date of the story (YYYYMMDD), a
period, and a 4-digit sequence number starting at "0001" for each date
(e.g. "XIA200101.0001"); in this way, every DOC in the corpus is
uniquely identifiable by the id string.

Every SGML tag is presented alone on one line, separate from other
tags, and from the text content (so a simple process like the UNIX
"grep -v '<'" will eliminate all tags, and retain all the text
content).

The structure shown above represents some notable differences relative
to the markup strategy employed in previous LDC text corpora; these
are intended to facilitate bulk processing of the present corpus.  The
major differences are:

 - Earlier corpora usually organized the data as one file per day, or
   limited the average file size to one megabyte (MB).

Typical compressed file sizes in the current corpus range from about 2
MB (1991 Xinhua data) to about 10 MB (2001-2 CNA data); this equates
to a range of about 4.5 to 27 MB when the data are uncompressed.  In
general, these files are not intended for use with interactive text
editors or word processing software (though many such programs are
likely to work reasonably well with these files).  Rather, it's
expected that the files will be used as input to programs that are
geared to dealing with data in such quantities, for filtering,
conditioning, indexing, statistical summary, etc.  (The LDC can
provide open source software, mostly written in Perl, for extracting
DOCs from such data files, using the "id" string or other search
criteria for story selection; see http://www.ldc.upenn.edu/Using/ .)

 - Earlier corpora tended to use different markup outlines (different
   tag sets) depending on the source of the data, because different
   sources came to us with different structural properties, and we had
   chosen to preserve these as much as possible (even though many
   elements of the delivered structure may have been meaningless for
   research use).

The present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and
core news content (usually containing paragraphs).  The "dateline" is
a brief string typically found at the beginning of the first paragraph
in each news story, giving the location the report is coming from, and
sometimes the news service and/or date; since this content is not part
of the initial sentence, we separate it from the first paragraph (this
was not done in previous corpora).

 - Earlier corpora tended to include "custom" SGML entity references,
   which were intended to preserve things like special punctuation or
   typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc).

The present corpus uses only three SGML entity reference: 
 - ``&amp;'' represents the literal ampersand "&" character
 - ``&lt;''  represents the literal open-angle bracket "<"
 - ``&gt;''  represents the literal close-angle bracket ">"
All other specialized control characters have been filtered out.

 - In earlier corpora, newswire data were presented as streams of
   undifferentiated "DOC" units; depending on the source and corpus,
   varying amounts of quality checking and filtering were done to
   eliminate noisy or unsuitable content (e.g. test messages).

For this release, all sources have received a uniform treatment in
terms of quality control, and we have applied a rudimentary (and
_approximate_) categorization of DOC units into four distinct "types".
The classification is indicated by the `` type="string" '' attribute
that is included in each opening ``DOC'' tag.  The four types are:

* story : This is by far the most frequent type, and it represents the
  most typical newswire item: a coherent report on a particular topic
  or event, consisting of paragraphs and full sentences.  As indicated
  above, the paragraph tag "<P>" is found only in DOCs of this type;
  in the other types described below, the text content is rendered
  with no additional tags or special characters -- just lines of
  tokens separated by whitespace.

* multi : This type of DOC contains a series of unrelated "blurbs",
  each of which briefly describes a particular topic or event; this is
  typically applied to DOCs that contain "summaries of todays news",
  "news briefs in ... (some general area like finance or sports)", and
  so on.  Each paragraph-like blurb by itself is coherent, but it does
  not bear any necessary relation of topicality or continuity relative
  to it neighbors.

* other : This represents DOCs that clearly do not fall into any of
  the above types -- in general, items of this type are intended for
  broad circulation (they are not advisories), they may be topically
  coherent (unlike "multi" type DOCs), and they typically do not
  contain paragraphs or sentences (they aren't really "stories");
  these are things like lists of sports scores, stock prices,
  temperatures around the world, and so on.

The general strategy for categorizing DOCs into these classes was, for
each source, to discover the most common and frequent clues in the
text stream that correlated with the "non-story" types, and to apply
the appropriate label for the ``type=...'' attribute whenever the DOC
displayed one of these specific clues.  When none of the known clues
was in evidence, the DOC was classified as a "story".

This means that the most frequent classification error will tend to be
the use of `` type="story" '' on DOCs that are actually some other
type.  But the number of such errors should be fairly small, compared
to the number of "non-story" DOCs that are correctly tagged as such.

Previous "Gigaword" corpora (in English and Chinese) had a fourth
category, "advis" (for "advisory"), which applied to DOCs that contain
text intended solely for news service editors, not the news-reading
public.  In preparing the Arabic data, the task of determining
patterns for assigning "non-story" type labels was carried out by a
native speaker of Arabic, and (for whatever reason) this person did
not find the "advis" category to be applicable to any of the data.

Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data.  For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content; but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.


DATA QUANTITIES
---------------

The "docs" directory contains a set of plain-text tables (datastats.*)
that describe the quantities of data by source and month (i.e. by
file), broken down according to the three "type" categories.  The
overall totals for each source are summarized below.  Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. nearly 4 gigabytes, total); the "Gzip-MB" column
shows totals for compressed file sizes as stored on the DVD-ROM; the
"K-wrds" numbers are simply the number of space separated tokens in
the text, excluding SGML tags.

Source	#Files	Gzip-MB	Totl-MB	K-wrds	 #DOCs
AFA	 104	 274	 1091	 94484	 516855
ALH	  95	 431	 1535	139501	 305250
ANN	  96	 415	 1530	140247	 327768
XIA	  24	  47	  192	 17387	 106846
TOTAL	 319	1167	 4348	391619	1256719

The following tables present "K-wrds" and "#DOCS" broken down by
source and DOC type:

	#DOCs  K-wrds
type="multi":
AFP	  3367	  440
ALH	  2148	 1277
ANN	  5786	 2070
XIA	  3484	  951
TOTAL	 14785	 4738

type="other":
AFP	 18335	 1598
ALH	  2642	 1233
ANN	  5482	 3405
XIA	  1422	  115
TOTAL	 27881	 6351

type="story":
AFP	 495153	  92439
ALH	 300460	 136987
ANN	 316500	 134786
XIA	 101940	  16327
TOTAL	1214053	 380539


GENERAL PROPERTIES OF THE DATA
------------------------------

The AFP Arabic archive was received at LDC via a continuous data feed
over a dedicated satellite dish and modem, spooling into daily files
on a main server computer.  At various times throughout the multi-year
collection period, there were intermittent problems with the equipment
or the signal reception, yielding "noise" and abrupt interruptions in
the data stream.  We have taken a range of steps to eliminate
fragmentary and noisy data from the collection in preparing this
release.  Through UTF-8 conversion and SGML validation, we can at
least be sure that the data contain only the appropriate characters
and, that all the markup is well formed.  It is still possible that a
handful of stories contain undetected "transients", e.g. cases where
the server shut down for an indeterminate period and then restarted,
leaving no detectable evidence in the data that was spooling onto
disk, resulting in one "news story" that actually contains parts of
two unrelated stories (but server interruptions were relatively
infrequent, and would usually leave evidence).  Also, some patterns of
character corruption may have gone undetected, if they happened to
consist entirely of "valid" character data (despite being nonsensical
to a human reader); based on the results of our quality-control passes
over these files, there may be a higher likelihood of undetected text
corruption in the period between June 1, 2001 and September 30, 2002.

The An Nahar and Al Hayat data sets were produced from bulk archives
that were delivered to the LDC on CD-ROM, and the Xinhua Arabic
archive was delivered in bulk via internet transfer.  As a result,
these sources avoided many of the problems that afflict transmission
through a serial modem.  Still, these archives contained noticeable
amounts of "noise" (unusable characters, null bytes, etc) which had to
be filtered out for research use.  To some extent, this is an
open-ended problem, and there may be kinds of error conditions that
have gone unnoticed or untreated -- this is true of any large text
collection -- but we have striven to assure that the characters
presented in all files are in fact valid and displayable, and that the
markup is fully compliant relative to the DTD provided here.


David Graff
Linguistic Data Consortium
July, 2003