Arabic Gigaword Second Edition

Item Name: Arabic Gigaword Second Edition
Author(s): David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda
LDC Catalog No.: LDC2006T02
ISBN: 1-58563-371-2
ISLRN: 299-814-033-635-4
Release Date: January 19, 2006
Member Year(s): 2006
DCMI Type(s): Text
Data Source(s): newswire
Application(s): information retrieval, language modeling, natural language processing
Language(s): Standard Arabic
Language ID(s): arb
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006T02 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David, et al. Arabic Gigaword Second Edition LDC2006T02. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View


Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC), at the University of Pennsylvania.

Arabic Gigaword Second Edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data.

Five distinct sources of Arabic newswire are represented here:

Agence France Presse (afp_arb; formally afa)
Al Hayat News Agency (hyt_arb; formally alh)
An Nahar News Agency (nhr_arb; formally ann)
Ummah Press (umh_arb)
Xinhua News Agency (xin_arb; formally xia)

The seven-letter codes in the parentheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-letter language code represents the standard Arabic in the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler three-character-code scheme was used to identify both the source and the language. The new convention allows us to distinguish data sets by source and language more naturally when a single newswire provider distributes data in multiple languages.

Ummah Press is a new source added to the Second Edition. The following table shows the new data that appear for the first time in the Second Edition.

Agence France Presse 2003.01-2004.12 143,766 documents
Al Hayat News Agency 2002.01-2003.12 64,308 documents
An Nahar News Agency 2003.01-2004.01 16,316 documents
Ummah Press 2003.01-2004.12 4,641 documents
Xinhua News Agency 2003.06-2004.12 10,6236 documents


The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3 gigabytes total), K-words are the number of space-separated tokens in the text, excluding SGML tags.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_ARB 128 355 1429 123594 660621
HYT_ARB 119 524 1861 169100 369555
NHR_ARB 109 457 1649 151078 344084
UMH_ARB 24 4 13 1201 4645
XIN_ARB 43 103 407 36933 213082
TOTAL 423 1443 5359 481906 1591987

All text files in this corpus have been converted to UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.

Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). Therefore, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.

Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":

story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences
multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on
other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on

The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."

Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data.

As described in the introduction section, a new naming scheme for file names and document IDs is used in the Second Edition. All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the seven-letter prefix and the date is included in the following table.

Old New


For an example of the data in this corpus, please examine this screenshot which is an image of the text from a single file.

Available Media

View Fees

Login for the applicable fee