README File for the ARABIC GIGAWORD CORPUS SECOND EDITION
=========================================================
INTRODUCTION
------------
Arabic Gigaword Second Edition was produced by Linguistic Data
Consortium (LDC) catalog number LDC2006T02 and ISBN
1-58563-371-2. This is a comprehensive archive of newswire text data
that has been acquired from Arabic news sources by the Linguistic Data
Consortium (LDC), at the University of Pennsylvania.
Arabic Gigaword Second Edition includes all of the content of the
first edition of Arabic Gigaword (LDC2003T12) as well as new data.
Five distinct sources of Arabic newswire are represented here:
- Agence France Presse (afp_arb; formally afa)
- Al Hayat News Agency (hyt_arb; formally alh)
- An Nahar News Agency (nhr_arb; formally ann)
- Ummah Press (umh_arb)
- Xinhua News Agency (xin_arb; formally xia)
The seven-character codes shown above represent both the directory
names where the data files are found, and the 7-letter prefix that
appears at the beginning of every file name. The 7-letter codes
consist of the three-character source name IDs and the three-character
language code ("arb") separated by an underscore ("_") character. The
three-letter language code represents the standard Arabic in the ISO
639-3 standard. In the first edition of the Arabic Gigaword corpus, a
simpler three-character-code scheme was used to identify both the
source and the language. The new convention allows us to distinguish
data sets by source and language more naturally when a single newswire
provider distributes data in multiple languages.
The five news services all use Modern Standard Arabic (MSA), so there
should be a fairly limited scope for orthographic and lexical
variation due to regional Arabic dialects. However, to the extent
that regional dialects might have an influence on MSA usage, it should
be noted that An Nahar is based in Beirut, Lebanon, and it may be safe
to assume that its material is created predominantly by speakers of
Levantine Arabic. Al Hayat was originally a Lebanese news service as
well, but it has been based in London during the entire period
represented in this archive (and its owners are in Saudi Arabia, so it
is sometimes referred to as a Saudi news service); even so, much of
its reporting/editorial staff may be of Levantine origins. The Xinhua
and AFP services are obviously international in scope (Xinhua is based
in Beijing, AFP in Paris), and we have no information about the
regional distribution of Arabic reporters and editors for these
services. Ummah Press is a new source added to the Second Edition.
The following table shows the new data that appear for the first time
in the Second Edition.
Agence France Presse 2003.01-2004.12 143766 documents
Al Hayat News Agency 2002.01-2003.12 64308 documents
An Nahar News Agency 2003.01-2004.01 16316 documents
Ummah Press 2003.01-2004.12 4641 documents
Xinhua News Agency 2003.06-2004.12 106236 documents
CHARACTER ENCODING
------------------
The original data archives received by the LDC used three different
character encodings for Arabic: An Nahar provided their archives in
MacArabic, Xinhua, Al Hayat and Ummah used CP1256, and AFP used a
7-bit encoding called ASMO 499. (In the earlier release of AFP Arabic
data, this was converted to ISO 8859-6, and that encoding served as
the source form for preparing the Gigaword release.) To avoid the
problems and confusion that could result from differences in
character-set specifications, all text files in this corpus have been
converted to UTF-8 character encoding.
Owing to the use of UTF-8, the SGML tagging within each file
(described in detail in the next section) shows up as lines of
single-byte-per-character (ASCII) text, whereas lines of actual text
data, including article headlines and datelines, contain a mixture of
single-byte and multi-byte characters. In general, single-byte
characters in the text data will consist of digits and punctuation
marks (where the original source relied on ASCII punctuation codes,
rather than Arabic-specific punctuation), whereas multi-byte
characters consist of Arabic letters and a small number of special
punctuation or other symbols. This variable-width character encoding
is intrinsic to UTF-8, and all UTF-8 capable processes will handle the
data appropriately.
The MacArabic encoding was designed to support ASCII digit characters
as well as the so-called Arabic-Indic digits, which have distinct
glyphs but are semantically equivalent to ASCII digits; CP1256 and
ASMO/ISO provide ASCII digits only. On inspecting the An Nahar text
data, we found that both ASCII and Arabic-Indic digits were used, but
there seemed to be no rule or pattern to predict which set would be
used in a given instance. In addition, because of the character
rendering assumptions that underly MacArabic encoding, strings of
Arabic-Indic digits are presented in text files using "right-to-left
display order" while ASCII digit strings use logical order.
Readers of Arabic always read digit strings in a manner equivalent to
readers of English and other left-to-right languages -- i.e. the most
significant digit is always displayed left-most in the string --
regardless of the glyphs being used for the digits. In terms of
ordering digit characters in a data stream, "logical order" refers to
having the most significant digit presented first in the stream. In
English and other left-to-right languages, "logical order" is
identical to "display order", but for Arabic, "logical order" is the
reverse of "right-to-left display order".
To minimize confusion and useless variability in the Gigaword text
files, we have converted all Arabic-Indic digits in An Nahar data to
their ASCII equivalents, and when these occurred in strings of 2 or
more digits, we have reversed the strings so that they are presented
in logical order in each file, to be consistent with the conventions
used in the other sources.
The original AFP source data always used right-to-left display order
for digit strings -- this is because the service assumes the data are
being supplied mainly to printing devices that operate in a strict,
linear right-to-left fashion. All digit strings in the AFP files have
been reversed in the Gigaword release to yield logical ordering.
DATA FORMAT AND SGML MARKUP
---------------------------
Each data file name consists of the 7-letter prefix, an underscore
character, and a 6-digit date (representing the year and month during
which the file contents were generated by the respective news source),
followed by a ".gz" file extension, indicating that the file contents
have been compressed using the GNU "gzip" compression utility (RFC
1952). So, each file contains all the usable data received by LDC for
the given month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure. The file "gigaword_a.dtd" in the "dtd" directory
provides the formal "Document Type Declaration" for parsing the SGML
content. The corpus has been fully validated by a standard SGML
parser utility (nsgmls), using this DTD file.
The markup structure, common to all data files, can be summarized as
follows:
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighbors. * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCs), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the three "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. nearly 5 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of space separated tokens in the text, excluding SGML tags. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_ARB 128 355 1429 123594 660621 HYT_ARB 119 526 1861 169100 369555 NHR_ARB 109 448 1649 151078 344084 UMH_ARB 24 4 13 1201 4645 XIN_ARB 43 100 407 36933 213082 TOTAL 423 1433 5359 481906 1591987 The following tables present "K-wrds" and "#DOCS" broken down by source and DOC type: #DOCs K-wrds type="multi": AFP_ARB 10696 2554 HYT_ARB 2875 1967 NHR_ARB 5807 2082 UMH_ARB 0 0 XIN_ARB 9635 2494 TOTAL 29013 9097 type="other": AFP_ARB 19677 1843 HYT_ARB 2814 1312 NHR_ARB 5923 3677 UMH_ARB 0 0 XIN_ARB 2221 179 TOTAL 30635 7011 type="story": AFP_ARB 630248 119190 HYT_ARB 363866 165820 NHR_ARB 332354 145332 UMH_ARB 4645 1201 XIN_ARB 201226 34267 TOTAL 1532339 465810 GENERAL PROPERTIES OF THE DATA ------------------------------ The AFP Arabic archive was received at LDC via a continuous data feed over a dedicated satellite dish and modem, spooling into daily files on a main server computer. At various times throughout the multi-year collection period, there were intermittent problems with the equipment or the signal reception, yielding "noise" and abrupt interruptions in the data stream. We have taken a range of steps to eliminate fragmentary and noisy data from the collection in preparing this release. Through UTF-8 conversion and SGML validation, we can at least be sure that the data contain only the appropriate characters and, that all the markup is well formed. It is still possible that a handful of stories contain undetected "transients", e.g. cases where the server shut down for an indeterminate period and then restarted, leaving no detectable evidence in the data that was spooling onto disk, resulting in one "news story" that actually contains parts of two unrelated stories (but server interruptions were relatively infrequent, and would usually leave evidence). Also, some patterns of character corruption may have gone undetected, if they happened to consist entirely of "valid" character data (despite being nonsensical to a human reader); based on the results of our quality-control passes over these files, there may be a higher likelihood of undetected text corruption in the period between June 1, 2001 and September 30, 2002. The An Nahar and Al Hayat data sets were produced from bulk archives that were delivered to the LDC on CD-ROM, and the Xinhua Arabic archive was delivered in bulk via internet transfer. The Ummah text were delivered via email transmission. As a result, these sources avoided many of the problems that afflict transmission through a serial modem. Still, these archives contained noticeable amounts of "noise" (unusable characters, null bytes, etc) which had to be filtered out for research use. To some extent, this is an open-ended problem, and there may be kinds of error conditions that have gone unnoticed or untreated -- this is true of any large text collection -- but we have striven to assure that the characters presented in all files are in fact valid and displayable, and that the markup is fully compliant relative to the DTD provided here. MAPPING FILE NAMES AND DOC IDS FROM FIRST EDITION ------------------------------------------------- All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the 7-letter prefix and the date is included in the following table. Old New --------------- AFA AFP_ARB_ ALH HYT_ARB_ ANN NHR_ARB XIA XIN_ARB_ DUPLICATE DOCUMENT INFORMATION ------------------------------ Some newswire sources may distribute stories that are fully or partially identical. We have not attempted to eliminate these duplications; however, we plan to make information about duplicate and similar articles available on our web site as supplemental information for this corpus. ADDITIONAL INFORMATION AND UPDATES ---------------------------------- Additional information, updates, and bug fixes may be available in the LDC catalog entry for this corpus (LDC2006T02) at: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2006T02 David Graff Linguistic Data Consortium July, 2003 (Updated for the Second Edition by Junbo Kong and Kazuaki Maeda, Dec. 2005)