README File for the ARABIC GIGAWORD CORPUS FIFTH EDITION
       ========================================================


INTRODUCTION
------------
Arabic Gigaword Fifth Edition was produced by Linguistic Data
Consortium (LDC); the catalog number is LDC2011T11  and the ISBN is
1-58563-595-2. This is a comprehensive archive of newswire text data
that has been acquired from Arabic news sources by the LDC at the
University of Pennsylvania.

Arabic Gigaword Fifth Edition includes all of the content of the
fourth edition of Arabic Gigaword (LDC2009T30) as well as new data.

Nine distinct sources of Arabic newswire are represented here:

 - Asharq Al-Awsat       (aaw_arb)
 - Agence France Presse  (afp_arb)
 - Al-Ahram              (ahr_arb)
 - Assabah News Agency   (asb_arb)
 - Al Hayat News Agency  (hyt_arb)
 - An Nahar News Agency  (nhr_arb)
 - Al-Quds Al-Arabi      (qds_arb)
 - Ummah Press           (umh_arb)
 - Xinhua News Agency    (xin_arb)

The seven-character codes shown above represent both the directory
names where the data files are found, and the 7-letter prefix that
appears at the beginning of every file name.  The 7-letter codes
consist of the three-character source name IDs and the three-character
language code ("arb") separated by an underscore ("_") character.

The nine news services all use Modern Standard Arabic (MSA), so there
should be a fairly limited scope for orthographic and lexical
variation due to regional Arabic dialects.  However, to the extent
that regional dialects might have an influence on MSA usage, the
following should be noted:

 - Asharq Al-Awsat is based in London, England, UK.

 - Al-Ahram is based in Cairo, Egypt.

 - An Nahar is based in Beirut, Lebanon, and it may be safe
to assume that its material is created predominantly by speakers of
Levantine Arabic.

 - Al Hayat was originally a Lebanese news service as well, but it has
been based in London during the entire period represented in this
archive (and its owners are in Saudi Arabia, so it is sometimes
referred to as a Saudi news service); even so, much of its
reporting/editorial staff may be of Levantine origins.

 - Assabah is based in Tunisia.

 - The Xinhua and AFP services are obviously international in scope
(Xinhua is based in Beijing, AFP in Paris), and we have no information
about the regional distribution of Arabic reporters and editors for
these services.

 - The content provided by Ummah Press comes from diverse sources
throughout the Arabic-speaking world.

 - Al-Quds Al-Arabi is based in London, England, UK and was founded by
Palestinian expatriates.


DIFFERENCES IN RELEASE 5 RELATIVE TO THE PREVIOUS RELEASE
---------------------------------------------------------


-- New Data 
 
 This release contains all data for the sources above collected by 
 LDC between January 2009 and December 2010.


-- Updates to data from the previous release
* Repeated documents in Asharq Al-Awsat data from 2008 were removed.

 * Document formatting and docid duplication problems were corrected
   in AFP data.

   Several documents in the 2007 and 2008 AFP data were found to
   contain some formatting errors, specifically, unescaped ampersands
   and some ASCII control characters. These problems were corrected.

   In addition, 3 documents were put in the wrong files and were moved
   to the appropriate locations. Moving these documents, in addition
   to an existing issue with duplicate docid assignments necessitated
   reassigning docids to existing documents and moving them to
   different locations within the file. 
   
   If more than one document in the file was found to have the same 
   docid, all documents after the first instance of the docid were 
   assigned new docids by finding the first available docid for the 
   date, starting at 5000, e.g.:

     AFP_ARB_20071019.0001  -> AFP_ARB_20071019.5000
   
   For information on which documents were moved, as well as which
   documents received new docids, please see:

     data/afp_moved_docs_and_reassigned_docids.tab

   This is a tab delimited file. The fields are:

     orig. file  - the file in which the document occured in the 4th
                   edition
     orig. docid - the docid the document was given in the 4th edition

     new file    - the file in which the document occurs in the
                   current
     
     new docid   - the docid the document is given in the current
                   edition
 * Significant duplication of content in 2007-8 An Nahar data was
   detected, and the duplicated documents removed. 

   A significant amount of data the 2007-8 An Nahar (41% of 
   documents) added in the previous release consisted of duplicated 
   documents; these have been removed the current release.

   For information on which documents were removed, please see:

     data/nhr_removed_duplicates.tab
   
   This is a tab delimited file, with the following structure:
        
     removed docid  - the docid that was removed from the corpus 

     matching docid - the docid of the identical document that
                       remains in the corpus

   NB: removal of duplicated documents using the same process as the
   removal above also produced a number of gaps in the docid sequences
   for the 2009-10 data.


CHARACTER ENCODING
------------------

The original data archives received by the LDC used a variety of
different character encodings for Arabic:

 - Asharq al-Awsat is delivered in CP-1256.

 - Al-Ahram is delivered in CP-1256.

 - An Nahar archives up to and including 2003 were provided in
MacArabic; the 2005 and 2006 archives were delivered as Microsoft 
Access Database files, with Unicode-encoded Arabic.  From 2007 
onward, An Nahar was collected in HTML format in UTF-8.


 - Assabah is delivered in CP-1256 encoding.

 - Al-Quds Al-Arabi is delivered in CP-1256 encoding.

 - Xinhua was delivered in CP-1256 until early 2008, when the encoding
   was changed to UTF-8

 - Ummah is delivered in CP-1256 and UTF-16

 - Earlier AFP used a 7-bit encoding called ASMO 499, which consists 
of a subset of the Arabic letters supported in CP-1256.  In mid-2007,
the delivery encoding was changed to UTF-8.

 - Al Hayat archives up to and including 2001 were provided in CP-1256,
with subsequent material provided in Unicode.  


To avoid the problems and confusion that could result from differences
in character-set specifications, all text files in this corpus have
been converted to Unicode UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file
(described in detail in the next section) shows up as lines of
single-byte-per-character (ASCII) text, whereas lines of actual text
data, including article headlines and datelines, contain a mixture of
single-byte and multi-byte characters.  In general, single-byte
characters in the text data will consist of white-space, digits and
punctuation marks, whereas multi-byte characters consist of Arabic
letters and a small number of special punctuation or other symbols.
This variable-width character encoding is intrinsic to UTF-8, and all
UTF-8 capable processes will handle the data appropriately.

Regarding the source data that was received by LDC as UTF-8 encoded text,
please note that converting the data to any non-Unicode encoding may be
impossible, or cause a loss of information, because the original data may
contain characters that are not mappable to a legacy Arabic-based encoding
such as CP-1256.  Such characters will typically be replaced by "?", or
may cause the conversion process to fail completely.  In particular, the
recent data from An Nahar contains a wide assortment of characters that do
not exist in any non-Unicode Arabic character set.  Smaller assortments
(and smaller quantities) of such characters are also found in the other
sources.  The set of potentially unmappable characters includes certain
accented Latin characters and various special symbols (currency, list-item
numbers or bullet points, special quotation marks, ellipsis, etc).

The MacArabic encoding was designed to support ASCII digit characters
as well as the so-called Arabic-Indic digits, which have distinct
glyphs but are semantically equivalent to ASCII digits; Unicode also
provides these special digit characters (in fact, two versions of
them) in its Arabic code page.  CP-1256 and ASMO/ISO provide ASCII
digits only.  In the An Nahar data, and in the more recent data from
Al Hayat, we found that both ASCII and Arabic-Indic digits were used,
but there seemed to be no rule or pattern to predict which set would
be used in a given instance.

In the case of the older An Nahar MacArabic, because of the character
rendering assumptions that underlie MacArabic encoding, strings of
Arabic-Indic digits are presented in text files using "right-to-left
display order" while ASCII digit strings used logical order.

Readers of Arabic always read digit strings in a manner equivalent to
readers of English and other left-to-right languages -- i.e. the most
significant digit is always displayed left-most in the string --
regardless of the glyphs being used for the digits.  In terms of
ordering digit characters in a data stream, "logical order" refers to
having the most significant digit presented first in the stream.  In
English and other left-to-right languages, "logical order" is
identical to "display order", but for Arabic, "logical order" is the
reverse of "right-to-left display order".

To minimize confusion and useless variability in the Gigaword text
files, we have converted all Arabic-Indic digits in An Nahar data to
their ASCII equivalents, and when these occurred in strings of 2 or
more digits, we have reversed the strings so that they are presented
in logical order in each file, to be consistent with the conventions
used in the other sources.

In the case of the more recent Al Hayat data, we found not only the
use of Arabic-Indic digits (which in this case used logical ordering),
but also a few instances where the Unicode "presentation form" Arabic
characters (in the code-point ranges U+FB50 through U+FDFF and U+FE70
through U+FEFF) were being used in place of the "normal" characters
(in the code-point range U+0600 through U+06FF).  For this source, we
again converted all digits to the ASCII range, and also used standard
Unicode normalization procedures to convert the presentation-form
letters to their "normal" forms.

Prior to 2008-07-13, the original AFP source data used right-to-left 
display order for digit strings -- this is because the service assumed
the data are being supplied mainly to printing devices that operate in 
a strict, linear right-to-left fashion.  All digit strings in the AFP 
files from before 2008-07-13 have been reversed in the Gigaword release 
to yield logical ordering.


DATA FORMAT AND SGML MARKUP
---------------------------

Each data file name consists of the 7-letter prefix, an underscore
character, and a 6-digit date (representing the year and month during
which the file contents were generated by the respective news source),
followed by a ".gz" file extension, indicating that the file contents
have been compressed using the GNU "gzip" compression utility (RFC
1952).  So, each file contains all the usable data received by LDC for
the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal
markup structure.  The markup structure, common to all data files, can 
be summarized as follows:

<DOC id="..." type="..." >
<HEADLINE>
The Headline Element is Optional -- not all DOCs have one
</HEADLINE>
<DATELINE>
The Dateline Element is Optional -- not all DOCs have one
</DATELINE>
<TEXT>
<P>
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
</P>
<P>
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
</P>
</TEXT>
</DOC>

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a
corresponding "closing" tag -- always.  The attribute values in the
DOC tag are always presented within double-quotes; the "id=" attribute
of DOC consists of the 7-letter source abbreviation (in CAPS), an
underscore character, an 8-digit date string representing the date of
the story (YYYYMMDD), a period, and a 4-digit sequence number starting
at "0001" for each date (e.g. "XIN_ARB_200101.0001"); in this way,
every DOC in the corpus is uniquely identifiable by the id string.

Every SGML tag is presented alone on one line, separate from other
tags, and from the text content (so a simple process like the UNIX
"grep -v '<'" will eliminate all tags, and retain all the text
content).

In general, these files are not intended for use with interactive text
editors or word processing software (though many such programs are
likely to work reasonably well with these files).  Rather, it's
expected that the files will be used as input to programs that are
geared to dealing with data in such quantities, for filtering,
conditioning, indexing, statistical summary, etc.  (The LDC can
provide open source software, mostly written in Perl, for extracting
DOCs from such data files, using the "id" string or other search
criteria for story selection; see http://www.ldc.upenn.edu/Using/ .)

 - Earlier corpora tended to use different markup outlines (different
   tag sets) depending on the source of the data, because different
   sources came to us with different structural properties, and we had
   chosen to preserve these as much as possible (even though many
   elements of the delivered structure may have been meaningless for
   research use).

The present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and
core news content (usually containing paragraphs).  The "dateline" is
a brief string typically found at the beginning of the first paragraph
in each news story, giving the location the report is coming from, and
sometimes the news service and/or date; since this content is not part
of the initial sentence, we separate it from the first paragraph (this
was not done in previous corpora).

 - Earlier corpora tended to include "custom" SGML entity references,
   which were intended to preserve things like special punctuation or
   typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc).

The present corpus uses only three SGML entity reference: 
 - ``&amp;'' represents the literal ampersand "&" character
 - ``&lt;''  represents the literal open-angle bracket "<"
 - ``&gt;''  represents the literal close-angle bracket ">"
All other specialized control characters have been filtered out.

 - In earlier corpora, newswire data were presented as streams of
   undifferentiated "DOC" units; depending on the source and corpus,
   varying amounts of quality checking and filtering were done to
   eliminate noisy or unsuitable content (e.g. test messages).

For this release, all sources have received a uniform treatment in
terms of quality control, and we have applied a rudimentary (and
_approximate_) categorization of DOC units into four distinct "types".
The classification is indicated by the `` type="string" '' attribute
that is included in each opening ``DOC'' tag.  The four types are:

* story : This is by far the most frequent type, and it represents the
  most typical newswire item: a coherent report on a particular topic
  or event, consisting of paragraphs and full sentences.  As indicated
  above, the paragraph tag "<P>" is found only in DOCs of this type;
  in the other types described below, the text content is rendered
  with no additional tags or special characters -- just lines of
  tokens separated by whitespace.

* multi : This type of DOC contains a series of unrelated "blurbs",
  each of which briefly describes a particular topic or event; this is
  typically applied to DOCs that contain "summaries of today's news",
  "news briefs in ... (some general area like finance or sports)", and
  so on.  Each paragraph-like blurb by itself is coherent, but it does
  not bear any necessary relation of topicality or continuity relative
  to it neighbors.

* other : This represents DOCs that clearly do not fall into any of
  the above types -- in general, items of this type are intended for
  broad circulation (they are not advisories), they may be topically
  coherent (unlike "multi" type DOCs), and they typically do not
  contain paragraphs or sentences (they aren't really "stories");
  these are things like lists of sports scores, stock prices,
  temperatures around the world, and so on.

The general strategy for categorizing DOCs into these classes was, for
each source, to discover the most common and frequent clues in the
text stream that correlated with the "non-story" types, and to apply
the appropriate label for the ``type=...'' attribute whenever the DOC
displayed one of these specific clues.  When none of the known clues
was in evidence, the DOC was classified as a "story".

This means that the most frequent classification error will tend to be
the use of `` type="story" '' on DOCs that are actually some other
type.  But the number of such errors should be fairly small, compared
to the number of "non-story" DOCs that are correctly tagged as such.

Other "Gigaword" corpora (in English and Chinese) had a fourth
category, "advis" (for "advisory"), which applied to DOCs that contain
text intended solely for news service editors, not the news-reading
public.  In preparing the Arabic data, the task of determining
patterns for assigning "non-story" type labels was carried out by a
native speaker of Arabic, and (for whatever reason) this person did
not find the "advis" category to be applicable to any of the data.

Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data.  For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content; but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.


DATA QUANTITIES
---------------

The "docs" directory contains a set of plain-text tables (datastats_*)
that describe the quantities of data by source and month (i.e. by
file), broken down according to the three "type" categories.  The
overall totals for each source are summarized below.  Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. nearly 5 gigabytes, total); the "Gzip-MB" column
shows totals for compressed file sizes as stored on the DVD-ROM; the
"K-wrds" numbers are simply the number of space separated tokens in
the text, excluding SGML tags.

  Source  #Files  Gzip-MB  Totl-MB   K-wrds    #DOCs
  --------------------------------------------------
  aaw_arb     50      228      777    75488   165990
  afp_arb    200      638     2368   226615  1084386
  ahr_arb     50      228      777    75103   191156
  asb_arb     76       72      242    23666    55963
  hyt_arb    190      744     2498   241107   510049
  nhr_arb    181      793     2696   262421   569045
  qds_arb     50      164      528    51599   117691
  umh_arb     92       12       43     4127    15675
  xin_arb    115      327     1197   117256   636212
  --------------------------------------------------
            1004     3206    11126  1077382  3346167

The following tables present "K-wrds" and "#DOCS" broken down by
source and DOC type:

multi
  aaw_arb        0        0        0
  afp_arb    20453     6550    74844
  ahr_arb        0        0        0
  asb_arb        0        0        0
  hyt_arb     2875     1967    20655
  nhr_arb     5807     2082    21227
  qds_arb        0        0        0
  umh_arb        0        0        0
  xin_arb     9635     2486    26022
  TOTAL      38770    13085   142748

other
  aaw_arb        0        0        0
  afp_arb   129044    14818   164028
  ahr_arb        0        0        0
  asb_arb     6476     2201    22388
  hyt_arb     2814     1314    14122
  nhr_arb     5923     3677    38398
  qds_arb        0        0        0
  umh_arb        0        0        0
  xin_arb     4655      324     3296
  TOTAL     148912    22334   242232

story
  aaw_arb   165990    73719   795943
  afp_arb   934889   199934  2184527
  ahr_arb   191156    73348   795291
  asb_arb    49487    20911   225646
  hyt_arb   504360   232180  2522690
  nhr_arb   557315   250524  2701510
  qds_arb   117691    50386   540453
  umh_arb    15675     4028    43927
  xin_arb   621922   111705  1195887
  TOTAL    3158485  1016735 11005874


GENERAL PROPERTIES OF THE DATA
------------------------------

Prior to July 2007, the AFP Arabic data were received at LDC via a 
continuous data feed over a dedicated satellite dish and modem, 
spooling into daily files on a main server computer.  At various times 
throughout the multi-year collection period, there were intermittent 
problems with the equipment or the signal reception, yielding "noise" 
and abrupt interruptions in the data stream.  We have taken a range of 
steps to eliminate fragmentary and noisy data from the collection in 
preparing this release.  Through UTF-8 conversion and SGML validation, 
we can at least be sure that the data contain only the appropriate 
characters and, that all the markup is well formed.  It is still 
possible that a handful of stories contain undetected "transients", 
e.g. cases where the server shut down for an indeterminate period and 
then restarted, leaving no detectable evidence in the data that was 
spooling onto disk, resulting in one "news story" that actually contains 
parts of two unrelated stories (but server interruptions were relatively
infrequent, and would usually leave evidence).  Also, some patterns of
character corruption may have gone undetected, if they happened to
consist entirely of "valid" character data (despite being nonsensical
to a human reader); based on the results of our quality-control passes
over these files, there may be a higher likelihood of undetected text
corruption in the period between June 1, 2001 and September 30, 2002.

From mid-July 2007 onwards, the AFP data were delivered in an XML format
via AFP's proprietary DreamServer delivery system.  In general, this
delivery method has largely eliminated the issues line noise and similar
data corruptions.

For Assabah, the LDC received a one-year archive of web content
covering the period of Sep. 2004 through Nov. 2006, and as of the
latter date, we have been maintaining a steady download of content on
a daily basis.

For data before 2005, data was delivered in MacArabic encoding. From 
January 2005 though November 2006 An Nahar data was provided to LDC in
a Microsoft Access database file on CD-ROM. Article data were extracted
from the database in the form of a single HTML stream for each year's 
archive.  The Arabic character content was rendered as numeric Unicode 
character entities, and these were converted to UTF-8 for publication 
by LDC.  In November 2006, LDC changed the An Nahar collection to a 
daily harvest via HTTP from the source's website.

Al Hayat data previous to 2004 were also produced from bulk CD-ROM
archives; the LDC has yet to acquire similar archives for the period
from January 2004 through October 2006.  However, we were able to
obtain relatively small portions of the 2005 and 2006 archives via
web download.  Starting in November 2006, we harvest the full content
of Al Hayat via web download on a daily basis, and this change in
collection is reflected in the last two monthly files (hyt_arb_200611
and hyt_arb_200612) in the fourth release, which are comparable in
size the the pre-2004 files.  

Most of the Xinhua Arabic archive was delivered in bulk via internet
transfer (FTP), and the LDC has been maintaining a steady download of
all content on a daily basis.

The Ummah text were delivered via email transmission, and includes
English translations for each of the stories delivered.  (The English
content is not provided here.)  Because of the low overall volume of
data received from this source, combined with significant variability
in their delivery methods and format, it was decided that the overall
benefit of providing new content from this source would not warrant
the effort required to normalize the material.  Although LDC has made
some effort to ensure the contents of the data provided in this release
constitute single articles in Arabic, it is possible that some Ummah
documents may contain content from more than a single logical document,
as well as sections of English text.

Asharq al-Awsat and Al-Quds Al-Arabi were retrieved via daily HTTP
downloads from the sources websites.  The content of the documents in
this corpus were then extracted automatically by scripts.

While all sources other than AFP have been received via internet
transfers of one sort or another, and have therefore avoided many of
the problems that afflict transmission through a serial modem, these
archives still contained noticeable amounts of "noise" (unusable
characters, null bytes, etc) which had to be filtered out for research
use.  To some extent, this is an open-ended problem, and there may be
kinds of error conditions that have gone unnoticed or untreated --
this is true of any large text collection -- but we have striven to
assure that the characters presented in all files are in fact valid
and displayable, and that the markup is fully compliant relative to
the DTD provided here.


DUPLICATE DOCUMENT INFORMATION
------------------------------

Some newswire sources may distribute stories that are fully or
partially identical.  We have not attempted to eliminate these
duplications; however, we plan to make information about duplicate and
similar articles available on our web site as supplemental information
for this corpus.


ADDITIONAL INFORMATION AND UPDATES
----------------------------------

Additional information, updates, and bug fixes may be available in the
LDC catalog entry for this corpus (LDC2011T??) at:

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T??

Robert Parker
Linguistic Data Consortium
Aug, 2011