README File for the ARABIC GIGAWORD CORPUS THIRD EDITION ======================================================== INTRODUCTION ------------ Arabic Gigaword Third Edition was produced by Linguistic Data Consortium (LDC); the catalog number is LDC2007T40 and the ISBN is 1-58563-460-3. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the LDC at the University of Pennsylvania. Arabic Gigaword Third Edition includes all of the content of the second edition of Arabic Gigaword (LDC2006T02) as well as new data. Six distinct sources of Arabic newswire are represented here: - Agence France Presse (afp_arb) - Assabah News Agency (asb_arb) - Al Hayat News Agency (hyt_arb) - An Nahar News Agency (nhr_arb) - Ummah Press (umh_arb) - Xinhua News Agency (xin_arb) The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The six news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects. However, to the extent that regional dialects might have an influence on MSA usage, the following should be noted: - An Nahar is based in Beirut, Lebanon, and it may be safe to assume that its material is created predominantly by speakers of Levantine Arabic. - Al Hayat was originally a Lebanese news service as well, but it has been based in London during the entire period represented in this archive (and its owners are in Saudi Arabia, so it is sometimes referred to as a Saudi news service); even so, much of its reporting/editorial staff may be of Levantine origins. - Assabah, which was not available in previous Gigaword releases, is based in Tunisia. - The Xinhua and AFP services are obviously international in scope (Xinhua is based in Beijing, AFP in Paris), and we have no information about the regional distribution of Arabic reporters and editors for these services. - The content provided by Ummah Press comes from diverse sources throughout the Arabic-speaking world. DIFFERENCES IN RELEASE 3 RELATIVE TO THE PREVIOUS RELEASE --------------------------------------------------------- -- Newly Added Data: The following table shows the new data that appear for the first time in the Third Edition. Newly Added Data Source Date Span Document count Agence France Presse 2005.01 - 2006.12 137815 Assabah News Agency 2004.09 - 2006.12 15410 (new source) Al Hayat News Agency 2005.01 - 2006.12 8799 (no data for 2004) An Nahar News Agency 2005.01 - 2006.12 104950 (no data for 2004) Xinhua News Agency 2005.01 - 2006.12 135472 (No new data was added from Ummah; in general, this source produces relatively small amounts of data per month.) -- Corrections to Content: The following problems, observed in previous releases of Arabic Gigaword, have been rectified in this release: - In most of the older AFP files (1994 - 2002), there where fairly frequent cases of very brief documents where the text content was not recognized as such; in these cases (involving over 15,000 stories in the 8-year span), the element appeared empty while the element contained anywhere from three to several lines of text. We have tried to rearrange the content in these docs, leaving only the first line as the headline, and moving the rest into the text segment. All stories of this sort had originally been classified as "other", and the classification has not been changed. - Recent Al Hayat data (from 2002 and 2003), contained some Arabic-Indic digits, despite our intention to convert all digit strings to the ASCII digit characters for consistency. (See discussion of digits in the Character Encoding section below.) - Some Al Hayat data had stray angle-bracket characters ("<" and ">"), which should have been rendered as "<" and ">". There were also some defective "Doc-ID" strings (the 'id' attribute in the "" tag that begins each news story) in the January 2001 data. - Some An Nahar data had "bare" ampersand characters ("&"), which should have been rendered as "&" character entities. - Some Xinhua documents included empty sub-elements (HEADLINE, DATELINE and/or TEXT sections containing no data); when HEADLINE or DATELINE were empty, these tags were removed. When the TEXT segment was empty, the document as a whole was removed. - In several Xinhua stories, the Doc-ID string, which is supposed provide the year, month, date and sequence number for the story, had become garbled, yielding an incorrect or impossible date string. A separate data file in the "docs" directory, called "docid_changes.txt", lists the changes in document inventory and Doc-ID strings. - Also in Xinhua, it was typical for stories to end with a formulaic Arabic string (meaning "end-of-story"), which should not have been included as part of the final paragraph in each story. - In general, we applied consistent line-wrapping to make the overall text presentation consistent across all sources (and consistent with Gigaword releases in other languages). We also made sure that the markup pattern described below is applied consistently for all sources, without exception. CHARACTER ENCODING ------------------ The original data archives received by the LDC used a variety of different character encodings for Arabic: - An Nahar archives up to and including 2003 were provided in MacArabic; the 2005 and 2006 archives added here were delivered as Microsoft Access Database files, with Unicode-encoded Arabic. - Assabah, Xinhua and Ummah used CP1256, and AFP used a 7-bit encoding called ASMO 499, which consisted of a subset of the Arabic letters supported in CP1256. - Al Hayat archives up to and including 2001 were provided in CP1256, but subsequent material used Unicode. To avoid the problems and confusion that could result from differences in character-set specifications, all text files in this corpus have been converted to Unicode UTF-8 character encoding. Owing to the use of UTF-8, the SGML tagging within each file (described in detail in the next section) shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of white-space, digits and punctuation marks, whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately. The MacArabic encoding was designed to support ASCII digit characters as well as the so-called Arabic-Indic digits, which have distinct glyphs but are semantically equivalent to ASCII digits; Unicode also provides these special digit characters (in fact, two versions of them) in its Arabic code page. CP1256 and ASMO/ISO provide ASCII digits only. In the An Nahar data, and in the more recent data from Al Hayat, we found that both ASCII and Arabic-Indic digits were used, but there seemed to be no rule or pattern to predict which set would be used in a given instance. In the case of the older An Nahar MacArabic, because of the character rendering assumptions that underly MacArabic encoding, strings of Arabic-Indic digits are presented in text files using "right-to-left display order" while ASCII digit strings used logical order. Readers of Arabic always read digit strings in a manner equivalent to readers of English and other left-to-right languages -- i.e. the most significant digit is always displayed left-most in the string -- regardless of the glyphs being used for the digits. In terms of ordering digit characters in a data stream, "logical order" refers to having the most significant digit presented first in the stream. In English and other left-to-right languages, "logical order" is identical to "display order", but for Arabic, "logical order" is the reverse of "right-to-left display order". To minimize confusion and useless variability in the Gigaword text files, we have converted all Arabic-Indic digits in An Nahar data to their ASCII equivalents, and when these occurred in strings of 2 or more digits, we have reversed the strings so that they are presented in logical order in each file, to be consistent with the conventions used in the other sources. In the case of the more recent Al Hayat data, we found not only the use of Arabic-Indic digits (which in this case used logical ordering), but also a few instances where the Unicode "presentation form" Arabic characters (in the code-point ranges U+FB50 through U+FDFF and U+FE70 through U+FEFF) were being used in place of the "normal" characters (in the code-point range U+0600 through U+06FF). For this source, we again converted all digits to the ASCII range, and also used standard Unicode normalization procedures to convert the presentation-form letters to their "normal" forms. The original AFP source data always used right-to-left display order for digit strings -- this is because the service assumes the data are being supplied mainly to printing devices that operate in a strict, linear right-to-left fashion. All digit strings in the AFP files have been reversed in the Gigaword release to yield logical ordering. DATA FORMAT AND SGML MARKUP --------------------------- Each data file name consists of the 7-letter prefix, an underscore character, and a 6-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The file "gigaword_a.dtd" in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The markup structure, common to all data files, can be summarized as follows: The Headline Element is Optional -- not all DOCs have one The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the 'type' attribute of the DOC happens to be "story" -- more on the 'type' attribute below...

Note that all data files use the UNIX-standard "\n" form of line termination, and text lines are generally wrapped to a width of 80 characters or less.

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding "closing" tag -- always. The attribute values in the DOC tag are always presented within double-quotes; the "id=" attribute of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at "0001" for each date (e.g. "XIN_ARB_200101.0001"); in this way, every DOC in the corpus is uniquely identifiable by the id string. Every SGML tag is presented alone on one line, separate from other tags, and from the text content (so a simple process like the UNIX "grep -v '<'" will eliminate all tags, and retain all the text content). In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, it's expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. (The LDC can provide open source software, mostly written in Perl, for extracting DOCs from such data files, using the "id" string or other search criteria for story selection; see http://www.ldc.upenn.edu/Using/ .) - Earlier corpora tended to use different markup outlines (different tag sets) depending on the source of the data, because different sources came to us with different structural properties, and we had chosen to preserve these as much as possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The "dateline" is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date; since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done in previous corpora). - Earlier corpora tended to include "custom" SGML entity references, which were intended to preserve things like special punctuation or typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc). The present corpus uses only three SGML entity reference: - ``&'' represents the literal ampersand "&" character - ``<'' represents the literal open-angle bracket "<" - ``>'' represents the literal close-angle bracket ">" All other specialized control characters have been filtered out. - In earlier corpora, newswire data were presented as streams of undifferentiated "DOC" units; depending on the source and corpus, varying amounts of quality checking and filtering were done to eliminate noisy or unsuitable content (e.g. test messages). For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types". The classification is indicated by the `` type="string" '' attribute that is included in each opening ``DOC'' tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. As indicated above, the paragraph tag "

" is found only in DOCs of this type; in the other types described below, the text content is rendered with no additional tags or special characters -- just lines of tokens separated by whitespace. * multi : This type of DOC contains a series of unrelated "blurbs", each of which briefly describes a particular topic or event; this is typically applied to DOCs that contain "summaries of todays news", "news briefs in ... (some general area like finance or sports)", and so on. Each paragraph-like blurb by itself is coherent, but it does not bear any necessary relation of topicality or continuity relative to it neighbors. * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike "multi" type DOCs), and they typically do not contain paragraphs or sentences (they aren't really "stories"); these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The general strategy for categorizing DOCs into these classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types, and to apply the appropriate label for the ``type=...'' attribute whenever the DOC displayed one of these specific clues. When none of the known clues was in evidence, the DOC was classified as a "story". This means that the most frequent classification error will tend to be the use of `` type="story" '' on DOCs that are actually some other type. But the number of such errors should be fairly small, compared to the number of "non-story" DOCs that are correctly tagged as such. Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content; but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. DATA QUANTITIES --------------- The "docs" directory contains a set of plain-text tables (datastats_*) that describe the quantities of data by source and month (i.e. by file), broken down according to the three "type" categories. The overall totals for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are uncompressed (i.e. nearly 5 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of space separated tokens in the text, excluding SGML tags. Source #Files Gzip-MB Totl-MB K-wrds #DOCs afp_arb 152 441 1806 147612 798436 asb_arb 28 23 77 6587 15410 hyt_arb 142 559 1932 171502 378353 nhr_arb 134 612 2172 193732 449340 umh_arb 24 4 14 1201 4645 xin_arb 67 171 672 56165 348551 TOTAL 547 1810 6673 576799 1994735 The following tables present "K-wrds" and "#DOCS" broken down by source and DOC type: #DOCs K-wrds TextKB type="multi": afp_arb 15016 4016 48577 asb_arb 0 0 0 hyt_arb 2875 1942 20655 nhr_arb 5807 1898 21227 umh_arb 0 0 0 xin_arb 9635 2232 26022 TOTAL 33333 10088 116481 type="other": afp_arb 86948 5679 71316 asb_arb 1576 401 6742 hyt_arb 2814 1241 14122 nhr_arb 5923 3221 38398 umh_arb 0 0 0 xin_arb 3346 180 2280 TOTAL 100607 10722 132858 type="story": afp_arb 696472 137917 1553438 asb_arb 13834 6186 68222 hyt_arb 372664 168319 1860879 nhr_arb 437610 188613 2082516 umh_arb 4645 1201 13163 xin_arb 335570 53753 604293 TOTAL 1860795 555989 6182511 GENERAL PROPERTIES OF THE DATA ------------------------------ The AFP Arabic archive was received at LDC via a continuous data feed over a dedicated satellite dish and modem, spooling into daily files on a main server computer. At various times throughout the multi-year collection period, there were intermittent problems with the equipment or the signal reception, yielding "noise" and abrupt interruptions in the data stream. We have taken a range of steps to eliminate fragmentary and noisy data from the collection in preparing this release. Through UTF-8 conversion and SGML validation, we can at least be sure that the data contain only the appropriate characters and, that all the markup is well formed. It is still possible that a handful of stories contain undetected "transients", e.g. cases where the server shut down for an indeterminate period and then restarted, leaving no detectable evidence in the data that was spooling onto disk, resulting in one "news story" that actually contains parts of two unrelated stories (but server interruptions were relatively infrequent, and would usually leave evidence). Also, some patterns of character corruption may have gone undetected, if they happened to consist entirely of "valid" character data (despite being nonsensical to a human reader); based on the results of our quality-control passes over these files, there may be a higher likelihood of undetected text corruption in the period between June 1, 2001 and September 30, 2002. For Assabah, the LDC received a one-year archive of web content covering the period of Sep. 2004 through Nov. 2006, and as of the latter date, we have been maintaining a steady download of content on a daily basis. The An Nahar data were produced from bulk archives delivered to LDC on CD-ROM. Content before 2004 was in MacArabic, while content for 2005 and 2006 were extracted from Microsoft Access database files, in the form of a single HTML stream for each year's archive. The Arabic character content was rendered as numeric Unicode character entities, and these were converted to utf8 for publication by LDC. Al Hayat data previous to 2004 were also produced from bulk CD-ROM archives; the LDC has yet to acquire similar archives for the period from January 2004 through October 2006. However, we were able to obtain relatively small portions of the 2005 and 2006 archives via web download. Starting in November 2006, we harvest the full content of Al Hayat via web download on a daily basis, and this change in collection is reflected in the last two monthly files (hyt_arb_200611 and hyt_arb_200612) in the present release, which are comparable in size the the pre-2004 files. Most of the Xinhua Arabic archive was delivered in bulk via internet transfer (FTP), and the LDC has been maintaining a steady download of all content on a daily basis. The Ummah text were delivered via email transmission, and includes English translations for each of the stories delivered. (The English content is not provided here.) Because of the low overall volume of data received from this source, combined with significant variability in their delivery methods and format, it was decided that the overall benefit of providing new content from this source would not warrant the effort required to normalize the material. While all sources other than AFP have been received via internet transfers of one sort or another, and have therefore avoided many of the problems that afflict transmission through a serial modem, these archives still contained noticeable amounts of "noise" (unusable characters, null bytes, etc) which had to be filtered out for research use. To some extent, this is an open-ended problem, and there may be kinds of error conditions that have gone unnoticed or untreated -- this is true of any large text collection -- but we have striven to assure that the characters presented in all files are in fact valid and displayable, and that the markup is fully compliant relative to the DTD provided here. DUPLICATE DOCUMENT INFORMATION ------------------------------ Some newswire sources may distribute stories that are fully or partially identical. We have not attempted to eliminate these duplications; however, we plan to make information about duplicate and similar articles available on our web site as supplemental information for this corpus. ADDITIONAL INFORMATION AND UPDATES ---------------------------------- Additional information, updates, and bug fixes may be available in the LDC catalog entry for this corpus (LDC2007T40) at: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2007T40 David Graff Linguistic Data Consortium Nov, 2007